I have been learning about sklearn preprocessing and pipelines and come across the concept of FunctionTransformer. I want to understand if one has to integrate it in a pipeline and pass arguments to a function which FunctionTransformer is referring to, how would that be done. Consider the example below, for simplicity, i have written a small function:
def return_selected_dataset(dataset, columns):
return dataset[columns]
pipe = Pipeline([('Return_Col', FunctionTransformer(return_selected_dataset))])
pipe.fit_transform(dataset, columns = ['Col1', 'Col2'])
I am getting the following error:
ValueError: Pipeline.fit does not accept the columns parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. `Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight)`.
How can I pass the value of columns to the function?
Also, can someone suggest any book or website where I can study the sklearn pipelines and preprocessing in detail and how to customize these processes?
Example dataset:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
X = pd.DataFrame({'Col1':[1,2],'Col2':[3,4],'Col3':[5,6]})
Your function:
def return_selected_dataset(dataset, columns):
return dataset[columns]
Without the pipeline, it would be like:
FunctionTransformer(return_selected_dataset,
kw_args={'columns':['Col1','Col2']}).transform(X)
Note with pipeline, you can only pass parameters to each of your fit steps,
see the help page:
**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
So I think what you can do is:
pipe = Pipeline([
('Return_Col',
FunctionTransformer(return_selected_dataset,
kw_args={'columns':['Col1','Col2']})
)
])
pipe.fit_transform(X)
Col1 Col2
0 1 3
1 2 4
Related
I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])
Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)
I am working on the toy dataset with ColumnTransformer and pipeline but I came across the error which I couldn't find a solution on the internet.
toy = pd.read_csv('toy_dataset.csv')
toy_drop=toy.drop(['Number','Illness'],axis=1)
toy_target= toy.Illness
toy_target=toy_target.to_frame()
Data is imported:
rb=RobustScaler()
normalization=MinMaxScaler()
ohe=OneHotEncoder(sparse=False)
le=LabelEncoder()
oe=OrdinalEncoder()
bins = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='uniform')
ct_features=ColumnTransformer([('normalization',normalization,['Income']),
('ohe',ohe,['City','Gender','Illness']),
('bins',bins,['Age']),
],remainder='drop')
pip = Pipeline([
("ct",ct_features),
#("collabel",ct_label),
('lr',LinearRegression())])
x_train,x_test,y_train,y_test=train_test_split(toy_drop,toy_target, test_size=0.2,random_state=2021)
pip.fit(x_train,y_train)
I think everything looks clear but this error:
ValueError: A given column is not a column of the dataframe
occurred.
Instead of remainder='drop' in the ColumnTransformer write: remainder='passthrough'.
As you can see at
sklearn documentation, by default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through
New to python and sklearn so apologies in advance. I have two transformers and I would like to gather the results in a `FeatureUnion (for a final modelling step at the end). This should be quite simple but FeatureUnion is stacking the outputs rather than providing an nx2 array or DataFrame. In the example below I will generate some data that is 10 rows by 2 columns. This will then generate two features that are 10 rows by 1 column. I would like the final feature union to have 10 rows and 1 column but what I get are 20 rows by 1 column.
I will try to demonstrate with my example below:
some imports
import numpy as np
import pandas as pd
from sklearn import pipeline
from sklearn.base import TransformerMixin
some random data
df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
a custom transformer that selects a column
class Trans(TransformerMixin):
def __init__(self, col_name):
self.col_name = col_name
def fit(self, X):
return self
def transform(self, X):
return X[self.col_name]
a pipeline that uses the transformer twice (in my real case I have two different transformers but this reproduces the problem)
pipe = pipeline.FeatureUnion([
('select_a', Trans('a')),
('select_b', Trans('b'))
])
now i use the pipeline but it returns an array of twice the length
pipe.fit_transform(df).shape
(20,)
however I would like an array with dimensions (10, 2).
Quick fix?
The transformers in the FeatureUnion need to return 2-dimensional matrices, however in your code by selecting a column, you are returning a 1-dimensional vector. You could fix this by selecting the column with X[[self.col_name]].
My question is I have so many columns in my pandas data frame and I am trying to apply the sklearn preprocessing using dataframe mapper from sklearn-pandas library such as
mapper= DataFrameMapper([
('gender',sklearn.preprocessing.LabelBinarizer()),
('gradelevel',sklearn.preprocessing.LabelEncoder()),
('subject',sklearn.preprocessing.LabelEncoder()),
('districtid',sklearn.preprocessing.LabelEncoder()),
('sbmRate',sklearn.preprocessing.StandardScaler()),
('pRate',sklearn.preprocessing.StandardScaler()),
('assn1',sklearn.preprocessing.StandardScaler()),
('assn2',sklearn.preprocessing.StandardScaler()),
('assn3',sklearn.preprocessing.StandardScaler()),
('assn4',sklearn.preprocessing.StandardScaler()),
('assn5',sklearn.preprocessing.StandardScaler()),
('attd1',sklearn.preprocessing.StandardScaler()),
('attd2',sklearn.preprocessing.StandardScaler()),
('attd3',sklearn.preprocessing.StandardScaler()),
('attd4',sklearn.preprocessing.StandardScaler()),
('attd5',sklearn.preprocessing.StandardScaler()),
('sbm1',sklearn.preprocessing.StandardScaler()),
('sbm2',sklearn.preprocessing.StandardScaler()),
('sbm3',sklearn.preprocessing.StandardScaler()),
('sbm4',sklearn.preprocessing.StandardScaler()),
('sbm5',sklearn.preprocessing.StandardScaler())
])
I am just wondering whether there is another more succinct way for me to preprocess many variables at one time without writing them out explicitly.
Another thing that I found a little bit annoying is when I transformed all the pandas data frame into arrays which sklearn can work with, they will lose the column name features, which makes the selection very difficult. Does anyone knows how to preserve the column names as the key when change the pandas data frames to np arrays?
Thank you so much!
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper
encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
[('gender', LabelBinarizer())] +
[(encoder, LabelEncoder()) for encoder in encoders] +
[(scalar, StandardScaler()) for scalar in scalars]
)
If you're doing this a lot, you could even write your own function:
mapper = data_frame_mapper(binarizers=['gender'],
encoders=['gradelevel', 'subject', 'districtid'],
scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])