AttributeError: 'ColumnTransformer' object has no attribute 'get_feature_names_out' - python

I have this:
Preprocessing
numeric_transformer = Pipeline(
steps=[("imputer",SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)
num=['hrs', 'absences', 'JobInvolvement', 'PerformanceRating', 'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age', 'DistanceFromHome', 'Education', 'EducationField', 'JobLevel', 'JobRole', 'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
categorical_transformer=Pipeline(
steps=[("imputer",SimpleImputer(strategy="most_frequent")), ("OE", OrdinalEncoder())] # DROP IF BINARY?
)
cat= ['BusinessTravel', 'Department', 'Gender', 'MaritalStatus']
preprocessor = ColumnTransformer(transformers=[
("numericals", numeric_transformer, num),
("categoricals", categorical_transformer,cat ) ], remainder='passthrough')
Function to simplify
def mod(a,b):
model = Pipeline(
steps=[("preprocessing", preprocessor), ("select", a),("clf", b)])
return model
Starting to create the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100431219)
clf=mod(SelectKBest(chi2),RandomForestClassifier()) # preprocessing, select, clf
param_grid = {'preprocessing__numericals__imputer__strategy': ['mean'],
'preprocessing__numericals__scaler': [MinMaxScaler()],
'preprocessing__categoricals__imputer__strategy': ['most_frequent'],
'select__k': list(range(1,14))}
inner = KFold(n_splits=7, shuffle=True, random_state=100431219)
clf = GridSearchCV(clf,
param_grid,
scoring='accuracy',
cv=inner,
n_jobs=4, verbose=1,
)
np.random.seed(100431219)
clf.fit(X_train, y_train)
And here I got the error:
trained_pipeline = clf.best_estimator_
print(f"Features selected: {trained_pipeline.named_steps['select'].get_support()}")
print(f"Locations where features selected: {np.where(trained_pipeline.named_steps['select'].get_support())}")
# Feature names before selection (i.e. after preprocessing)
feature_names_before_selection = trained_pipeline.named_steps['preprocessing'].get_feature_names_out() # In this line is the error
print(f"In Scikit-learn 1.x, we can even get the feature names after selection: {trained_pipeline.named_steps['select'].get_feature_names_out(feature_names_before_selection)}")
I obtained the number of features, their positions but not their names. I want the names

As you are using sklearn version 2.24 you should be referring to specific docs.
Have a look at this link https://scikit-learn.org/0.24/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=get_feature_names#sklearn.compose.ColumnTransformer.get_feature_names
It says that there is a method ColumnTransformer.get_feature_names. There is no such method `get_feature_names.
For more information about this change you can have a look here https://github.com/scikit-learn/scikit-learn/pull/18444
The likely fix to your issue is using this code:
feature_names_before_selection = trained_pipeline.named_steps['preprocessing'].get_feature_names()

Related

Sequential Feature Selection - UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples

I'm trying to implement a Sequential Feature Selection and it gives me this error.
Here's my code.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
X = df.copy()
y = X.pop("stroke")
Xtrain, X_val, ytrain, y_val = train_test_split(
X, y, random_state=1, test_size=0.2, shuffle=True
)
categorical_cols = X.select_dtypes("object").columns.tolist()
numerical_cols = X.select_dtypes("float64").columns.tolist()
numerical_transformer = MinMaxScaler()
categorical_transformer = TargetEncoder()
preprocessor = ColumnTransformer(
remainder="passthrough",
transformers=[
("num", numerical_transformer, numerical_cols),
("cat", categorical_transformer, categorical_cols),
],
)
Xtrain_pre = preprocessor.fit_transform(Xtrain, ytrain)
sfs = SFS(LogisticRegression(), k_features="best", scoring="precision")
sfs.fit(Xtrain_pre, ytrain)
The output I'm getting:
output
I know that there's some questions, but I did not see anything using the SFS. Can someone help?

Scikit Learn Pipeline with SMOTE

I would like to create a Pipeline with SMOTE() inside, but I can't figure out where to implement it.
My target value is imbalanced. Without SMOTE I have very bad results.
My code:
df_n = df[['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value',
'source','browser','sex','age', 'is_fraud']]
#Definition X et y:
X = df_n.drop(['is_fraud'], axis = 1)
y = df_n.is_fraud
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print(Counter(y_train)) #Counter({0: 95844, 1: 9934})
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean'))
,('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant'))
,('encoder', OrdinalEncoder())
])
numeric_features = ['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value', 'age']
categorical_features = ['source', 'browser', 'sex']
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
regressors = [
RandomForestRegressor()
,LogisticRegression()
,DecisionTreeClassifier()
,KNeighborsClassifier()
,LinearSVC(random_state=42)]
for regressor in regressors:
pipeline = Pipeline(steps = [
('preprocessor', preprocessor)
,('regressor',regressor)
])
model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
print(regressor)
print(r2_score(y_test, predictions))
My results:
RandomForestRegressor()
0.48925960579049166
LogisticRegression()
0.24151543370722806
DecisionTreeClassifier()
-0.14622417739659155
KNeighborsClassifier()
0.3542030752350408
LinearSVC(random_state=42)
-0.10256098450762474
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
Xsm,ysm = sampler.fit_resample(X, y)
You can use below code for adding SMOTE in pipeline (need some tweaking though)
from imblearn.pipeline import Pipeline
# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
treat smote separately not inside pipeline by using this code
What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to

Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of

I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The below is my code :
def test_sklearn_pipeline(random_state_num):
numeric_features = ["x","y"]
categorical_features = ["wconfid","pctid"]
missing_features = ["x"]
missing_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="mean"))]
)
scale_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("miss", missing_transformer, missing_features),
("cat", categorical_transformer, categorical_features),
('outlier_remover',outlier_removal,numeric_features),
("num", scale_transformer, numeric_features)
],remainder='passthrough'
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
)
df = pd.read_csv('accelerometer_modified.csv')
df = df.drop(columns=['random'])
X,y = df.drop(columns=['z']),df.loc[:,'z']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
clf.fit(X_train, y_train)
print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))
Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale

sklearn.exceptions.NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator

I want to draw a decisiontree. But my data is text data. So I used Pipeline. However, the same error as the title appears. Please tell me how I can plot a tree with my data using graphviz or plot tree
data_files = 'dataset2-Komoran.xlsx'
data = pd.read_excel(data_files)
train_data = data[['title','category','processed_title']]
categories=train_data['category']
labels=list(set(categories))
n_classes=len(labels)
print('possible categories',labels)
for l in labels:
print('number of ', l, len(train_data.loc[train_data['category']==l]))
X_train, X_test, y_train, y_test = train_test_split(train_data['processed_title'],train_data['category'],test_size=0.2,random_state=57)
model = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', DecisionTreeClassifier()),
])
model.fit(X_train, y_train)
export_graphviz(model,
out_file='tree.dot'
)

Sklearn Pipelines: Value Error - Expected number of features

I created a pipeline that basically loops over models and scalers and performs recursive feature elimination (RFE) as follows:
def train_models(models, scalers, X_train, y_train, X_val, y_val):
best_results = {'f1_score': 0}
for model in models:
for scaler in scalers:
for n_features in list(range(
len(X_train.columns),
int(len(X_train.columns)/2),
-10
)):
rfe = RFE(
estimator=model,
n_features_to_select=n_features,
step=10
)
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
results = evaluate(y_val, y_pred) #Returns a dictionary of values
results['pipeline'] = pipe
results['y_pred'] = y_pred
if results['f1_score'] > best_results['f1_score']:
best_results = results
print("Best F1: {}".format(best_results['f1_score']))
return best_results
The pipeline works fine inside the function and is able to predict and score the results properly.
However, when I call pipeline.predict() outside the function, e.g.
best_result = train_models(models, scalers, X_train, y_train, X_val, y_val)
pipeline = best_result['pipeline']
pipeline.predict(X_val)
I get the following error:
Here is what pipeline looks like:
Pipeline(steps=[('scaler', StandardScaler()),
('selector',
RFE(estimator=LogisticRegression(C=1, max_iter=1000,
penalty='l1',
solver='liblinear'),
n_features_to_select=78, step=10)),
('model',
LogisticRegression(C=1, max_iter=1000, penalty='l1',
solver='liblinear'))])
I'm guessing the model in the pipeline is expecting 48 features instead of 78, but I don't understand where the number 48 is coming from since n_features_to_select is set to 78 in the previous RFE step!
Any help would be greatly appreciated!
I do not have your data. But doing some math and guessing based on the info you have shared, 48 seems to be the last n_features that your nested loop tries. This makes me suspect that the culprit is a shallow copy. I suggest you change the following:
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
to
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', copy.deepcopy(model))
])
and try again (after first doing an import copy too, of course).

Categories