How to use imblearn undersampler in pipeline? - python

I have the following pipeline construction:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
sel = SelectKBest(k='all',score_func=chi2)
under = RandomUnderSampler(sampling_strategy=0.2)
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
final_pipe = Pipeline(steps=[('sample',under),('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])
however i get the following error:
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample (but not both) or be a string 'passthrough' '<class 'sklearn.compose._column_transformer.make_column_selector'>' (type <class 'type'>) doesn't)
I don't understand what I am doing wrong? Can anybody help ?

#Math12, I have encountered this same issue recently and the way I got around it, is to wrap the RandomUnderSampler() in a custom function which is then further transformed by a FunctionTransformer.
I have reworded your code to look this way and it worked.
Below is a snippet of the code sample
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
sel = SelectKBest(k='all',score_func=chi2)
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
def Data_Preprocessing_3(df):
# fit random under sampler on the train data
rus = RandomUnderSampler(sampling_strategy=0.2)
df = rus.fit_resample(df)
return df
# in a separate code line outside the above function, transform the function #with a FunctionTransformer
under = FunctionTransformer(Data_Preprocessing_3)
#implement your pipeline as done initially
final_pipe = Pipeline(steps=[('sample',under),('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

Related

Unable to make prediction with Sklearn model on pyspark dataframe

I have loaded sklearn model successfully but unable to make predictions on pyspark dataframe. While running the below given code, getting an error mentioned below. Please help me to get the code to make predictions with sklearn model on pyspark. I also have searched relevant questions but could not find the solution.
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value
#update prediction method
def predictor(cols):
#call predict method for model
return model.value.predict(*cols)
udf_predictor = udf(predictor, FloatType())
#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))
I get the following error message
TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.
I think you were on the right track for reaching your expected output.
I managed to find two possible solutions for such problem: one uses Spark UDF, the other uses Pandas UDF.
Spark UDF
from pyspark.sql.functions import udf
#udf('integer')
def predict_udf(*cols):
return int(braodcast_model.value.predict((cols,)))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_udf(*list_of_columns))
Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf
#pandas_udf('integer')
def predict_pandas_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(braodcast_model.value.predict(X))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_pandas_udf(*list_of_columns))
Reproducible example
Here I used a Databricks Community cluster with Spark 3.1.2, pandas==1.2.4 and pyarrow==4.0.0.
broadcasted_model is a simple logistic regression from scikit-learn, trained on the breast cancer dataset.
import pandas as pd
import joblib
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from pyspark.sql.functions import udf, pandas_udf
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# split in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# create a small pipeline with standardization and model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# save and reload the model
path = '/databricks/driver/test_model.joblib'
joblib.dump(model, path)
loaded_model = joblib.load(path)
# sample of unseen data
df = spark.createDataFrame(X_test.sample(50, random_state=42))
# create broadcasted model
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
Then I used the two methods illustrated above and you will see that the outputs df_prediction will be the same in both cases.

Pre-process y column in a sklearn pipeline before classification

As part of a sklearn pipeline, I'd like to bin my response variable into a variable with k ordinal categories and then do classification on these categories. I found KBinsDiscretizer which seems to perform this transformation but it seems it does only work on feature columns, not on the target column.
Reproducible example
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
binarizer_col_y = make_column_transformer(
[sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal'), ['target']],
remainder = 'passthrough'
)
pipeline = Pipeline(steps = [
('preprocess', binarizer_col_y),
('ols', LinearRegression())
])
pipeline.fit(df[data['feature_names']], df['target'])
This errors with
pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'target'
The above exception was the direct cause of the following exception:
...
[ another key error for 'target']
I also found sklearn.compose.TransformedTargetRegressor to transform the response (but I want to do classification) and that I can write my own transformers, but they apparently ony modify X, not y.
Can anyone tell me how to modify y in a pre-processing step prior to classification as part of a pipeline?
Why inside the pipeline?
The idea is to move as many transformations into the pipeline as possible, reducing boilerplate code, avoiding data leaks plus simplifying model deployment (e.g. as services like Databricks model registry can deploy a sklearn model with pre-processing expected to happen inside the model).
You get the error because target is not available because the transformation is applied only to the X, and not y.
Sklearn pipeline does not support transforming target y in the way you tried to write it.
However, there is a sklearn.compose.TransformedTargetRegressor which can wrap model and can be provided with instructions how to transform target.
Warning, it is not well supported and I found many issues when trying to work with it on a real project. Maybe you want to have manual target transformation steps.
Here is a little demo that might work for you.
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
data = load_boston()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
X = df[data["feature_names"]]
y = data["target"]
pipeline = Pipeline(
steps=[
(
"ols",
TransformedTargetRegressor(
LinearRegression(),
transformer=sklearn.preprocessing.KBinsDiscretizer(
n_bins=3, encode="ordinal"
),
),
)
]
)
pipeline.fit(X, y)
pipeline.predict(X)
Or a more readable snippet that shows how you create target transformer
model = LinearRegression()
kbins = sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode="ordinal")
ttr = TransformedTargetRegressor(model, transformer=kbins)

output decision tree in the pipeline manner

Hi as I am new to machine learning methods using the sklearn library, I try to incorporate the decision tree into pipeline and then make both the prediction and output of the model, but as I run the following code, I got the warning:
'Pipeline' object has no attribute 'tree_'
So I wonder if the pipeline does not support with tree output, and how am I able to fix this problem? I have also tried using the decision_tree class directly, but I got another warning that:
setting an array element with a sequence.
I know that this appears as I have vectors with different dimension, but still no clue how to deal with the situation.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn import tree
# a function that reads the corpus, tokenizes it and returns the documents
# and their labels
def read_corpus(corpus_file, use_sentiment):
documents = []
labels = []
with open(corpus_file, encoding='utf-8') as f:
for line in f:
tokens = line.strip().split()
documents.append(tokens[3:])
if use_sentiment:
# 2-class problem: positive vs negative
labels.append( tokens[1] )
else:
# 6-class problem: books, camera, dvd, health, music, software
labels.append( tokens[0] )
return documents, labels
# a dummy function that just returns its input
def identity(x):
return x
# read the data and split i into train and test
X, Y = read_corpus('/Users/dengchenglong/Downloads/trainset', use_sentiment=False)
split_point = int(0.75*len(X))
Xtrain = X[:split_point]
Ytrain = Y[:split_point]
Xtest = X[split_point:]
Ytest = Y[split_point:]
# let's use the TF-IDF vectorizer
tfidf = False
# we use a dummy function as tokenizer and preprocessor,
# since the texts are already preprocessed and tokenized.
if tfidf:
vec = TfidfVectorizer(preprocessor = identity,
tokenizer = identity)
else:
vec = CountVectorizer(preprocessor = identity,
tokenizer = identity)
# combine the vectorizer with a Naive Bayes classifier
classifier = Pipeline( [('vec', vec),
('cls', tree.DecisionTreeClassifier())])
# train the classifier on the train dataset
decision_tree = classifier.fit(Xtrain, Ytrain)
# predict the labels of the test data
Yguess = classifier.predict(Xtest)
tree.plot_tree(classifier.fit(Xtest, Ytest))
# report performance of the classifier
print(accuracy_score(Ytest, Yguess))
print(classification_report(Ytest, Yguess))
What if you try this:
from sklearn.pipeline import make_pipeline
# combine the vectorizer with a Naive Bayes classifier
clf = DecisionTreeClassifier()
classifier = make_pipeline(vec,clf)
As it seems, before using pipeline you must initiate the model you are trying to apply. Let me know if this works and if not, the errors it's returning.
From: Scikit-learn documentation
Example out of: Make pipeline example with trees

python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit

I am trying to implement SMOTE of imblearn inside the Pipeline. My data sets are text data stored in pandas dataframe. Please see below the code snippet
text_clf =Pipeline([('vect', TfidfVectorizer()),('scale', StandardScaler(with_mean=False)),('smt', SMOTE(random_state=5)),('clf', LinearSVC(class_weight='balanced'))])
After this I am using GridsearchCV.
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning parameters mostly for TfidfVectorizer().
I am getting the following error.
All intermediate steps should be transformers and implement fit and transform. 'SMOTE
Post this error, I have changed the code to as follows.
vect = TfidfVectorizer(use_idf=True,smooth_idf = True, max_df = 0.25, sublinear_tf = True, ngram_range=(1,2))
X = vect.fit_transform(X).todense()
Y = vect.fit_transform(Y).todense()
X_Train,X_Test,Y_Train,y_test = train_test_split(X,Y, random_state=0, test_size=0.33, shuffle=True)
text_clf =make_pipeline([('smt', SMOTE(random_state=5)),('scale', StandardScaler(with_mean=False)),('clf', LinearSVC(class_weight='balanced'))])
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning Cin SVC classifiers.
This time I am getting the following error:
Last step of Pipeline should implement fit.SMOTE(....) doesn't
What is going here? Can anyone please help?
imblearn.SMOTE has no transform method. Docs is here.
But all steps except the last in a pipeline should have it, along with fit.
To use SMOTE with sklearn pipeline you should implement a custom transformer calling SMOTE.fit_sample() in transform method.
Another easier option is just to use ibmlearn pipeline:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
# This doesn't work with sklearn.pipeline.Pipeline because
# SMOTE doesn't have a .tranform() method.
# (It has .fit_sample() or .sample().)
pipe = imbPipeline([
...
('oversample', SMOTE(random_state=5)),
('clf', LinearSVC(class_weight='balanced'))
])

How do I make GridSeachCV work with a custom transformer in my pipeline?

If I exclude my custom transformer the GridSearchCV runs fine, but with, it errors.
Here is a fake dataset:
import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler
df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
"Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4],
"Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})
class MyTransformer(TransformerMixin):
def transform(self, x, **transform_args):
x["Number"] = x["Number"].apply(lambda row: row*2)
return x
def fit(self, x, y=None, **fit_args):
return self
x_train = df
y_train = x_train.pop("Label")
mapper = DataFrameMapper([
("Number", MinMaxScaler()),
("Letter", LabelBinarizer()),
])
pipe = Pipeline([
("custom", MyTransformer()),
("mapper", mapper),
("classifier", RandomForestClassifier()),
])
param_grid = {"classifier__min_samples_split":[10,20], "classifier__n_estimators":[2,3,4]}
model_grid = sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=2, scoring="accuracy")
model_grid.fit(x_train, y_train)
and the error is
list indices must be integers, not str
How can I make GridSearchCV work while there is a custom transformer in my pipeline?
I know this answer comes rather late, but I've encountered the same behavior with sklearn and BaseSearchCV derivative classes. The problem actually seems to stem from the _PartitionIterator class in the sklearn cross_validation module, as it makes the assumption that everything emitted from every TransformerMixin class in the pipeline is going to be array-like, and thus it generates slices of indices that are used to index incoming X args in a array-like manner. Here's the __iter__ method:
def __iter__(self):
ind = np.arange(self.n)
for test_index in self._iter_test_masks():
train_index = np.logical_not(test_index)
train_index = ind[train_index]
test_index = ind[test_index]
yield train_index, test_index
And the BaseSearchCV grid search metaclass calls cross_validation's _fit_and_score, which uses a method called safe_split. Here's the relevant line:
X_subset = [X[idx] for idx in indices]
This will absolutely produce unexpected results if X is a pandas dataframe, which you're emitting from your transform function.
There are two ways I've found to fix this:
Make sure to return an array from your transformer:
return x.as_matrix()
This is a hack. If the pipe of transformers demands the input to the next transformer be a DataFrame, as was my case, you can write a utilities script that is essentially the same as the sklearn grid_search module, but includes some clever validation methods that are called in the _fit method of the BaseSearchCV class:
def _validate_X(X):
"""Returns X if X isn't a pandas frame, otherwise
the underlying matrix in the frame. """
return X if not isinstance(X, pd.DataFrame) else X.as_matrix()
def _validate_y(y):
"""Returns y if y isn't a series, otherwise the array"""
if y is None:
return y
# if it's a series
elif isinstance(y, pd.Series):
return np.array(y.tolist())
# if it's a dataframe:
elif isinstance(y, pd.DataFrame):
# check it's X dims
if y.shape[1] > 1:
raise ValueError('matrix provided as y')
return y[y.columns[0]].tolist()
# bail and let the sklearn function handle validation
return y
As an example, here's my "custom grid_search module".
Short version: pandas and scikit-learn's cross validation methods didn't like to talk in that way (in my version, 0.15); this may be fixed simply by updating scikit-learn to 0.16/stable or 0.17/dev.
The GridSearchCV class validates the data and converts it to an array (so that it can perform CV splits correctly). So you don't get to use Pandas DataFrame features inside of built-in cross validation loops.
You will have to make your own cross-validation routines that don't do the validation if you want to do this kind of thing.
EDIT: This is my experience with scikit-learn's cross validation routines. It is why sklearn-pandas provides cross_val_score. However, so far as I can tell, GridSearchCV is not specialized by sklearn-pandas; your import of it accidentally imports the default sklearn version. Therefore, you may have to implement you own grid search using ParameterGrid and sklearn-pandas's cross_val_score.

Categories