Sklearn Pipelines: Value Error - Expected number of features - python

I created a pipeline that basically loops over models and scalers and performs recursive feature elimination (RFE) as follows:
def train_models(models, scalers, X_train, y_train, X_val, y_val):
best_results = {'f1_score': 0}
for model in models:
for scaler in scalers:
for n_features in list(range(
len(X_train.columns),
int(len(X_train.columns)/2),
-10
)):
rfe = RFE(
estimator=model,
n_features_to_select=n_features,
step=10
)
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
results = evaluate(y_val, y_pred) #Returns a dictionary of values
results['pipeline'] = pipe
results['y_pred'] = y_pred
if results['f1_score'] > best_results['f1_score']:
best_results = results
print("Best F1: {}".format(best_results['f1_score']))
return best_results
The pipeline works fine inside the function and is able to predict and score the results properly.
However, when I call pipeline.predict() outside the function, e.g.
best_result = train_models(models, scalers, X_train, y_train, X_val, y_val)
pipeline = best_result['pipeline']
pipeline.predict(X_val)
I get the following error:
Here is what pipeline looks like:
Pipeline(steps=[('scaler', StandardScaler()),
('selector',
RFE(estimator=LogisticRegression(C=1, max_iter=1000,
penalty='l1',
solver='liblinear'),
n_features_to_select=78, step=10)),
('model',
LogisticRegression(C=1, max_iter=1000, penalty='l1',
solver='liblinear'))])
I'm guessing the model in the pipeline is expecting 48 features instead of 78, but I don't understand where the number 48 is coming from since n_features_to_select is set to 78 in the previous RFE step!
Any help would be greatly appreciated!

I do not have your data. But doing some math and guessing based on the info you have shared, 48 seems to be the last n_features that your nested loop tries. This makes me suspect that the culprit is a shallow copy. I suggest you change the following:
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
to
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', copy.deepcopy(model))
])
and try again (after first doing an import copy too, of course).

Related

AttributeError: 'ColumnTransformer' object has no attribute 'get_feature_names_out'

I have this:
Preprocessing
numeric_transformer = Pipeline(
steps=[("imputer",SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)
num=['hrs', 'absences', 'JobInvolvement', 'PerformanceRating', 'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age', 'DistanceFromHome', 'Education', 'EducationField', 'JobLevel', 'JobRole', 'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
categorical_transformer=Pipeline(
steps=[("imputer",SimpleImputer(strategy="most_frequent")), ("OE", OrdinalEncoder())] # DROP IF BINARY?
)
cat= ['BusinessTravel', 'Department', 'Gender', 'MaritalStatus']
preprocessor = ColumnTransformer(transformers=[
("numericals", numeric_transformer, num),
("categoricals", categorical_transformer,cat ) ], remainder='passthrough')
Function to simplify
def mod(a,b):
model = Pipeline(
steps=[("preprocessing", preprocessor), ("select", a),("clf", b)])
return model
Starting to create the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100431219)
clf=mod(SelectKBest(chi2),RandomForestClassifier()) # preprocessing, select, clf
param_grid = {'preprocessing__numericals__imputer__strategy': ['mean'],
'preprocessing__numericals__scaler': [MinMaxScaler()],
'preprocessing__categoricals__imputer__strategy': ['most_frequent'],
'select__k': list(range(1,14))}
inner = KFold(n_splits=7, shuffle=True, random_state=100431219)
clf = GridSearchCV(clf,
param_grid,
scoring='accuracy',
cv=inner,
n_jobs=4, verbose=1,
)
np.random.seed(100431219)
clf.fit(X_train, y_train)
And here I got the error:
trained_pipeline = clf.best_estimator_
print(f"Features selected: {trained_pipeline.named_steps['select'].get_support()}")
print(f"Locations where features selected: {np.where(trained_pipeline.named_steps['select'].get_support())}")
# Feature names before selection (i.e. after preprocessing)
feature_names_before_selection = trained_pipeline.named_steps['preprocessing'].get_feature_names_out() # In this line is the error
print(f"In Scikit-learn 1.x, we can even get the feature names after selection: {trained_pipeline.named_steps['select'].get_feature_names_out(feature_names_before_selection)}")
I obtained the number of features, their positions but not their names. I want the names
As you are using sklearn version 2.24 you should be referring to specific docs.
Have a look at this link https://scikit-learn.org/0.24/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=get_feature_names#sklearn.compose.ColumnTransformer.get_feature_names
It says that there is a method ColumnTransformer.get_feature_names. There is no such method `get_feature_names.
For more information about this change you can have a look here https://github.com/scikit-learn/scikit-learn/pull/18444
The likely fix to your issue is using this code:
feature_names_before_selection = trained_pipeline.named_steps['preprocessing'].get_feature_names()

Scikit Learn Pipeline with SMOTE

I would like to create a Pipeline with SMOTE() inside, but I can't figure out where to implement it.
My target value is imbalanced. Without SMOTE I have very bad results.
My code:
df_n = df[['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value',
'source','browser','sex','age', 'is_fraud']]
#Definition X et y:
X = df_n.drop(['is_fraud'], axis = 1)
y = df_n.is_fraud
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print(Counter(y_train)) #Counter({0: 95844, 1: 9934})
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean'))
,('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant'))
,('encoder', OrdinalEncoder())
])
numeric_features = ['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value', 'age']
categorical_features = ['source', 'browser', 'sex']
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
regressors = [
RandomForestRegressor()
,LogisticRegression()
,DecisionTreeClassifier()
,KNeighborsClassifier()
,LinearSVC(random_state=42)]
for regressor in regressors:
pipeline = Pipeline(steps = [
('preprocessor', preprocessor)
,('regressor',regressor)
])
model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
print(regressor)
print(r2_score(y_test, predictions))
My results:
RandomForestRegressor()
0.48925960579049166
LogisticRegression()
0.24151543370722806
DecisionTreeClassifier()
-0.14622417739659155
KNeighborsClassifier()
0.3542030752350408
LinearSVC(random_state=42)
-0.10256098450762474
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
Xsm,ysm = sampler.fit_resample(X, y)
You can use below code for adding SMOTE in pipeline (need some tweaking though)
from imblearn.pipeline import Pipeline
# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
treat smote separately not inside pipeline by using this code
What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to

Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of

I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The below is my code :
def test_sklearn_pipeline(random_state_num):
numeric_features = ["x","y"]
categorical_features = ["wconfid","pctid"]
missing_features = ["x"]
missing_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="mean"))]
)
scale_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("miss", missing_transformer, missing_features),
("cat", categorical_transformer, categorical_features),
('outlier_remover',outlier_removal,numeric_features),
("num", scale_transformer, numeric_features)
],remainder='passthrough'
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
)
df = pd.read_csv('accelerometer_modified.csv')
df = df.drop(columns=['random'])
X,y = df.drop(columns=['z']),df.loc[:,'z']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
clf.fit(X_train, y_train)
print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))
Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale

sklearn.exceptions.NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator

I want to draw a decisiontree. But my data is text data. So I used Pipeline. However, the same error as the title appears. Please tell me how I can plot a tree with my data using graphviz or plot tree
data_files = 'dataset2-Komoran.xlsx'
data = pd.read_excel(data_files)
train_data = data[['title','category','processed_title']]
categories=train_data['category']
labels=list(set(categories))
n_classes=len(labels)
print('possible categories',labels)
for l in labels:
print('number of ', l, len(train_data.loc[train_data['category']==l]))
X_train, X_test, y_train, y_test = train_test_split(train_data['processed_title'],train_data['category'],test_size=0.2,random_state=57)
model = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', DecisionTreeClassifier()),
])
model.fit(X_train, y_train)
export_graphviz(model,
out_file='tree.dot'
)

Python Scikit-Learn: Custom Analyzer for TfidfVectorizer

So i am trying to understand how to write a custom analyzer for python scikit-learn's TfidfVectorizer.
I am working on the following Kaggle competition
https://www.kaggle.com/c/whats-cooking
as a firs step, i do some clean up on the ingredients column as
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
after that i create a pipeline using TfidfVectorizer and the LogisticRegression classifier
pip = Pipeline([
('vect', TfidfVectorizer(
stop_words='english',
sublinear_tf=True,
use_idf=bestParameters['vect__use_idf'],
max_df=bestParameters['vect__max_df'],
ngram_range=bestParameters['vect__ngram_range']
)),
('clf', LogisticRegression(C=bestParameters['clf__C']))
])
then i fit my training set and finally, i predict
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
parameters = {}
grid_searchTS = GridSearchCV(pip,parameters,n_jobs=3, verbose=1, scoring='accuracy')
grid_searchTS.fit(X_train, y_train)
predictions = grid_searchTS.predict(X_test)
lastly, i check how my classifier did by
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
now this gives me around 78% accuracy. fine. now i basically perform the same steps but with one change. instead of creating a new column in the dataframe for a cleaned up version of the ingredients, i want to create a custom analyzer that will do the same thing. so write
def customAnalyzer(text):
lemTxt = ["".join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', ingred)) for ingred in lines.lower()]) for lines in sorted(text)]
return " ".join(lemTxt).strip()
and of course i change the pipeline as
pip = Pipeline([
('vect', TfidfVectorizer(
stop_words='english',
sublinear_tf=True,
use_idf=bestParameters['vect__use_idf'],
max_df=bestParameters['vect__max_df'],
ngram_range=bestParameters['vect__ngram_range'],
analyzer=customAnalyzer
)),
('clf', LogisticRegression(C=bestParameters['clf__C']))
])
lastly, since i think my customAnalyzer will take care of everything, i create my train test split as
X, y = traindf['ingredients'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
but to my surprise, my accuracy drops to 24% !
Is my intuition of using the custom analyzer in this way correct?
Do i also need to implement a custom tokenizer?
My intention is to use each ingredient as an independent entity. I do not want to deal with words. When i create my ngrams, i want the ngrams to be made out of each individual ingredient instead of each word.
How would i achieve this?

Categories