Preserve column order after applying sklearn.compose.ColumnTransformer - python

I'm using Pipeline and ColumnTransformer modules from sklearn library to perform feature engineering on my dataset.
The dataset initially looks like this:
date
date_block_num
shop_id
item_id
item_price
02.01.2013
0
59
22154
999.00
03.01.2013
0
25
2552
899.00
05.01.2013
0
25
2552
899.00
06.01.2013
0
25
2554
1709.05
15.01.2013
0
25
2555
1099.00
$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 date object
1 date_block_num object
2 shop_id object
3 item_id object
4 item_price float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
Then I have the following transformations:
num_column_transformer = ColumnTransformer(
transformers=[
("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
],
remainder="passthrough"
)
num_pipeline = Pipeline(
steps=[
("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
attribute_percent_name="shop_id",
attribute_total_name="item_cnt_day",
new_attribute_name="%_item_cnt_day_per_shop")
),
("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
attribute_percent_name="item_id",
attribute_total_name="item_cnt_day",
new_attribute_name="%_item_cnt_day_per_item")
),
("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
attribute_percent_name="shop_id",
new_attribute_name="%_sales_per_shop")
),
("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
attribute_percent_name="item_id",
new_attribute_name="%_sales_per_item")
),
("num_column_transformer", num_column_transformer),
]
)
The first four Transformers create four new different numeric variables and the last one applies StandardScaler over all the numerical values of the dataset.
After executing it, I get the following data:
0
1
2
3
4
5
6
7
8
-0.092652
-0.765612
-0.173122
-0.756606
-0.379775
02.01.2013
0
59
22154
-0.092652
1.557684
-0.175922
1.563224
-0.394319
03.01.2013
0
25
2552
-0.856351
1.557684
-0.175922
1.563224
-0.394319
05.01.2013
0
25
2552
-0.092652
1.557684
-0.17613
1.563224
-0.396646
06.01.2013
0
25
2554
-0.092652
1.557684
-0.173278
1.563224
-0.380647
15.01.2013
0
25
2555
I'd like to have the following output:
date
date_block_num
shop_id
item_id
item_price
%_item_cnt_day_per_shop
%_item_cnt_day_per_item
%_sales_per_shop
%_sales_per_item
02.01.2013
0
59
22154
-0.092652
-0.765612
-0.173122
-0.756606
-0.379775
03.01.2013
0
25
2552
-0.092652
1.557684
-0.175922
1.563224
-0.394319
05.01.2013
0
25
2552
-0.856351
1.557684
-0.175922
1.563224
-0.394319
06.01.2013
0
25
2554
-0.092652
1.557684
-0.17613
1.563224
-0.396646
15.01.2013
0
25
2555
-0.092652
1.557684
-0.173278
1.563224
-0.380647
As you can see, columns 5, 6, 7, and 8 from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price feature lies in the outputted table.
How could I preserve the column order and names? After that, I want to do feature engineering over categorical variables and my Transformers make use of the feature column name.
Am I using correctly the Scikit-Learn API?

There's one point to be aware of when dealing with ColumnTransformer, which is reported within the doc as follows:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.
That's the reason why your ColumnTransformer instance messes things up. Indeed, consider this simplified example which resembles your setting:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
df = pd.DataFrame({
'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'],
'date_block_num': ['0', '0', '0', '0', '0'],
'shop_id': ['59', '25', '25', '25', '25'],
'item_id': ['22514', '2252', '2252', '2254', '2255'],
'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).
Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:
ct = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])
pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):
from sklearn.base import BaseEstimator, TransformerMixin
class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):
def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
self.attribute_percent_name = attribute_percent_name
self.new_attribute_name = new_attribute_name
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
return df
ct_pipe = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
], verbose_feature_names_out=False)
pipe = Pipeline([
('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
attribute_percent_name='shop_id',
new_attribute_name='%_item_cnt_day_per_shop')
),
('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
attribute_percent_name='item_id',
new_attribute_name='%_item_cnt_day_per_item')
),
('column_trans', ct_pipe),
])
pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())
As a final remark, observe that the verbose_feature_names_out=False parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer.

Answer using scikit-learn 1.2.1
In scikit-learn 1.2 it's possible to set the output of the ColumnTransformer to a pandas dataframe, avoiding this transformation in a second step. Besides this, in the answer proposed by #amiola the ColumnTransformer makes use of a passthrough phase to preserve the order of string-type columns with respect to numeric ones, but this only works if all the string-type columns are before the numerics. To show this I use the same example converting the shop_id column to numeric:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
df = pd.DataFrame({
'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'],
'date_block_num': ['0', '0', '0', '0', '0'],
'shop_id': [59, 25, 25, 25, 25],
'item_id': ['22514', '2252', '2252', '2254', '2255'],
'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
]).set_output(transform='pandas')
out_df = ct.fit_transform(df)
out_df
pass__date
pass__date_block_num
pass__item_id
std_scaler__shop_id
std_scaler__item_price
0
02.01.2013
0
22514
2.0
-0.402369
1
03.01.2013
0
2252
-0.5
-0.732153
2
05.01.2013
0
2252
-0.5
-0.732153
3
06.01.2013
0
2254
-0.5
1.939261
4
15.01.2013
0
2255
-0.5
-0.072585
As it's possible to see the shop_id column is moved to the end, for the same reason also explained in amiola's answer (i.e. columns are reordered following the order of the transformation in the ColumnTrasnformer). To overcome this issue, you can reorder dataframe columns after the transformation with verbose_feature_names_out set to False to preserve the same starting column names (beware that those names must be unique, see docs). There's also no need to create a specific passthrough step.
ct = ColumnTransformer([
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
remainder='passthrough',
verbose_feature_names_out=False).set_output(transform='pandas')
out_df = ct.fit_transform(df)
out_df = out_df[df.columns]
out_df
date
date_block_num
shop_id
item_id
item_price
0
02.01.2013
0
2.0
22514
-0.402369
1
03.01.2013
0
-0.5
2252
-0.732153
2
05.01.2013
0
-0.5
2252
-0.732153
3
06.01.2013
0
-0.5
2254
1.939261
4
15.01.2013
0
-0.5
2255
-0.072585

Related

Determine the proportion of missing values for each Cow

I have a large dataframe with over 100 columns. One of the columns is "Cow". For each value of "Cow" I would like to determine the number of missing values in each of the other columns.
Using code from Get proportion of missing values per Country
I am able to tabulate the number of missing values for one column at a time. By repeating the code for each column and then merging the dataframes I am able to build a dataframe that has the proportion of missing values for each cow for each column. The problem is that I have over 100 columns.
The following creates a short data example
import pandas as pd
import numpy as np
mast_model_data = [[1152, '1', '10', '23'], [1154, '1', '4', '43'],
[1155, 'NA', '3', '76'], [1152, '1', '10', 'NA'],
[1155, '2', '10', '65'], [1152, '1', '4', 'NA']]
df = pd.DataFrame(mast_model_data, columns =['Cow', 'Lact', 'Procedure', 'Height'])
df.loc[:,'Lact'] = df['Lact'].replace('NA', np.nan)
df.loc[:,'Procedure'] = df['Procedure'].replace('NA', np.nan)
df.loc[:,'Height'] = df['Height'].replace('NA', np.nan)
df
The data is presented below
Cow Lact Procedure Height
0 1152 1 10 23
1 1154 1 4 43
2 1155 NaN 3 76
3 1152 1 10 NaN
4 1155 2 10 65
5 1152 1 4 NaN
The code that I am using to tabulate missing data is as follows
df1 = (df.groupby('Cow')['Lact']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Lact'))
df2 = (df.groupby('Cow')['Procedure']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Procedure'))
df3 = (df.groupby('Cow')['Height']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Height'))
missing = df1.merge(df2, on=['Cow'], how="left")
missing = missing.merge(df3, on=['Cow'], how="left")
missing
The output of the code above is
Cow Lact Procedure Height
0 1152 0.0 0.0 0.666667
1 1154 0.0 0.0 0.000000
2 1155 0.5 0.0 0.000000
The actual dataframe has more cows and columns so to complete the table will require a lot of repitition
I anticipate there is a more refined way that does not require the repetition required for the method that I am using.
Appreciate advice on how I can streamline the code.
Try as follows:
missing = df.set_index('Cow').isna().groupby(level=0).mean()\
.reset_index(drop=False)
print(missing)
Cow Lact Procedure Height
0 1152 0.0 0.0 0.666667
1 1154 0.0 0.0 0.000000
2 1155 0.5 0.0 0.000000
Explanation
Set column Cow as the index, and apply df.isna to get a mask of bool values with True for NaN values.
Now, chain df.groupby on the index (i.e. level=0), retrieve the mean, and reset the index again.

NaN Values inputed into test and train data

I am working on a Data Science project with the Fifa dataset. I cleaned the data and took care of any NaN values in the Data to get it ready to be split into test and train. I need to use StratifiedShuffleSplit in order to split the data. Updated to a cleaner way to divided the value data into groups, but I am still getting NaN values once it goes through the split.
Link to the data set I am using: https://www.kaggle.com/karangadiya/fifa19
n = fifa['value'].count()
folds = 3
fifa.sort_values('value', ascending=False, inplace=True)
fifa['group_id'] = np.floor(np.arange(n)/folds)
fifa['value_cat'] = fifa.groupby('group_id', as_index = False)['name'].transform(lambda x: np.random.choice(v_cats, size=x.size, replace = False))
At this point when I check the test and train data I now have mystery NaN values inputed. I think the NaN values maybe a result of .loc since I am getting a 'warning' in jupyter.
c:\python37\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Code below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(fifa, fifa['value_cat']):
strat_train_set = fifa.loc[train_index]
strat_test_set = fifa.loc[test_index]
fifa = strat_train_set.drop('value', axis=1)
value_labels = strat_train_set['value'].copy()
PLEASE HELP MY POOR SOUL!!
Here's one solution.
import numpy as np
import pandas as pd
n = 100
folds = 3
# Make some data
df = pd.DataFrame({'id':np.arange(n), 'value':np.random.lognormal(mean=10, sigma=1, size=n)})
# Sort by value
df.sort_values('value', ascending=False, inplace=True)
# Insert 'group' ids, 0, 0, 0, 1, 1, 1, 2, 2, 2, ...
df['group_id'] = np.floor(np.arange(n)/folds)
# Randomly assign folds within each group
df['fold'] = df.groupby('group_id', as_index=False)['id'].transform(lambda x: np.random.choice(folds, size=x.size, replace=False))
# Inspect
df.head(10)
id value group_id fold
46 46 208904.679048 0.0 0
3 3 175730.118616 0.0 2
0 0 137067.103600 0.0 1
87 87 101894.243831 1.0 2
11 11 100570.573379 1.0 1
90 90 93681.391254 1.0 0
73 73 92462.150435 2.0 2
13 13 90349.408620 2.0 1
86 86 87568.402021 2.0 0
88 88 82581.010789 3.0 1
Assuming you want k folds, the idea is to sort the data by value, then randomly assign folds 1, 2, ..., k to the first k rows, then do the same to the next k rows, etc.
By the way, you will have more luck getting answers to questions here if you can create reproducible examples with data that make it easy for others to tinker with. :)

Using SMOTE to oversample a binary class; why is it returning random float values between 0 and 1?

I'm using SMOTE to resample a binary class TARGET_FRAUD which includes values 0 and 1. 0 has around 900 records, while 1 only has about 100 records. I want to oversample class 1 to around 800.
This is to perform some classificatioin modeling.
#fix imbalanced data
from imblearn.over_sampling import SMOTE
#bar plot of target_fraud distribution
sns.countplot('TARGET_FRAUD', data=df)
plt.title('Before Resampling')
plt.show()
#Synthetic Minority Over-Sampling Technique
sm = SMOTE()
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(df.drop('TARGET_FRAUD', axis=1), df['TARGET_FRAUD'])
resampled_df = pd.concat([pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)], axis=1)
resampled_df.columns = df.columns
sns.countplot('TARGET_FRAUD', data=resampled_df)
plt.title('After Resampling')
plt.show()
This is the count of values before resampling:
TARGET_FRAUD:
0 898
1 102
This is the count of values after resampling:
1.000000 1251
0.000000 439
0.188377 1
0.228350 1
0.577813 1
0.989742 1
0.316744 1
0.791926 1
0.970161 1
0.757886 1
0.089506 1
0.567179 1
0.331502 1
0.563530 1
0.882599 1
0.918105 1
0.613229 1
0.239910 1
0.487373 1
...
Why is it producing random float values between 0 and 1? I only want it to return int values of 0 and 1.
I do not have your dataset but based on your code I made a reproducible example. I cannot replicate what you are writing.
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(random_state=0, weights=[0.9, 0.1])
df = pd.DataFrame(X)
df["TARGET_FRAUD"] = y
print("Before resampling")
print(Counter(df["TARGET_FRAUD"]))
sm = SMOTE()
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(
df.drop("TARGET_FRAUD", axis=1), df["TARGET_FRAUD"]
)
resampled_df = pd.concat(
[pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)],
axis=1,
)
print("Before resampling")
print(Counter(resampled_df["TARGET_FRAUD"]))
which prints
Before resampling
Counter({0: 90, 1: 10})
Before resampling
Counter({0: 90, 1: 90})

OneHotEncoder only a single feature which is string

I want one of my ONLY ONE of my features to be converted to a separate binary features:
df["pattern_id"]
Out[202]:
0 3
1 3
...
7440 2
7441 2
7442 3
Name: pattern_id, Length: 7443, dtype: int64
df["pattern_id"]
Out[202]:
0 0 0 1
1 0 0 1
...
7440 0 1 0
7441 0 1 0
7442 0 0 1
Name: pattern_id, Length: 7443, dtype: int64
I want to use OneHotEncoder, data is int, so no need to encode it:
onehotencoder = OneHotEncoder(categorical_features=["pattern_id"])
df = onehotencoder.fit_transform(df).toarray()
ValueError: could not convert string to float: 'http://www.zaragoza.es/sedeelectronica/'
Interesting enough I receive an error... sklearn tried to encode another column, not the one I wanted.
We have to encode pattern_id to be an integer value
I used this link: Issue with OneHotEncoder for categorical features
#transform the pattern_id feature to int
encoding_feature = ["pattern_id"]
enc = LabelEncoder()
enc.fit(encoding_feature)
working_feature = enc.transform(encoding_feature)
working_feature = working_feature.reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)
#convert the pattern_id feature to separate binary features
onehotencoder = OneHotEncoder(categorical_features=working_feature, sparse=False)
df = onehotencoder.fit_transform(df).toarray()
And I get the same error. What am I doing wrong ?
Edit
source:
https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/analysis.py
df
Out[259]:
found_img is_http link_img \
0 True 0 img/aahoteles.svg
//www.zaragoza.es/cont/paginas/img/sede/logo_e...
pattern_id current_link site_id \
0 3 https://www.aa-hoteles.com/es/reservas 3
6 3 https://www.aa-hoteles.com/es/ofertas-hoteles 3
7 2 http://about.pressreader.com/contact-us/ 4
8 3 http://about.pressreader.com/contact-us/ 4
status link_id
0 200 https://www.aa-hoteles.com/
1 200 https://www.365travel.asia/
2 200 https://www.365travel.asia/
3 200 https://www.365travel.asia/
4 200 https://www.aa-hoteles.com/
5 200 https://www.aa-hoteles.com/
6 200 https://www.aa-hoteles.com/
7 200 http://about.pressreader.com
8 200 http://about.pressreader.com
9 200 https://www.365travel.asia/
10 200 https://www.365travel.asia/
11 200 https://www.365travel.asia/
12 200 https://www.365travel.asia/
13 200 https://www.365travel.asia/
14 200 https://www.365travel.asia/
15 200 https://www.365travel.asia/
16 200 https://www.365travel.asia/
17 200 https://www.365travel.asia/
18 200 http://about.pressreade
[7443 rows x 8 columns]
If you take a look at the documentation for OneHotEncoder you can see that the categorical_features argument expects '“all” or array of indices or mask' not a string. You can make your code work by changing to the following lines
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a dataframe of random ints
df = pd.DataFrame(np.random.randint(0, 4, size=(100, 4)),
columns=['pattern_id', 'B', 'C', 'D'])
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')])
df = onehotencoder.fit_transform(df)
However df will no longer be a DataFrame, I would suggest working directly with the numpy arrays.
You can also do it like this
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(df.required_column.values.reshape(-1, 1)).toarray()
We need to reshape the column, because fit_transform requires a 2-D array. Then you can add columns to this numpy array and then merge it with your DataFrame.
Seen from this link here
The recommended way to work with different column types is detailed in the sklearn documentation here.
Representative example:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

How to do Onehotencoding in Sklearn Pipeline

I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.
This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called 'dummies'.
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies] +
[(d, OneHotEncoder()) for d in dummies]
)
And this is the code to create a pipeline, including the mapper and linear regression.
from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression
lm = PMMLPipeline([("mapper", mapper),
("regressor", LinearRegression())])
When I now try to fit (with 'features' being a dataframe, and 'targets' a series), it gives an error 'could not convert string to float'.
lm.fit(features, targets)
OneHotEncoder doesn't support string features, and with [(d, OneHotEncoder()) for d in dummies] you are applying it to all dummies columns. Use LabelBinarizer instead:
mapper = DataFrameMapper(
[(d, LabelBinarizer()) for d in dummies]
)
An alternative would be to use the LabelEncoder with a second OneHotEncoder step.
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies]
)
lm = PMMLPipeline([("mapper", mapper),
("onehot", OneHotEncoder()),
("regressor", LinearRegression())])
LabelEncoder and LabelBinarizer are intended for encoding/binarizing the target (label) of your data, i.e. the y vector. Of course they do more or less the same thing as OneHotEncoder, the main difference being the Label preprocessing steps don't accept matrices, only 1-D vectors.
example = pd.DataFrame({'x':np.arange(2,14,2),
'cat1':['A','B','A','B','C','A'],
'cat2':['p','q','w','p','q','w']})
dummies = ['cat1', 'cat2']
x cat1 cat2
0 2 A p
1 4 B q
2 6 A w
3 8 B p
4 10 C q
5 12 A w
As an example, LabelEncoder().fit_transform(example['cat1']) works, but LabelEncoder().fit_transform(example[dummies]) throws a ValueError exception.
In contrast, OneHotEncoder accepts multiple columns:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(example[dummies])
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
This can be incorporated into a pipeline using a ColumnTransformer, passing through (or alternatively applying different transformations to) the other columns :
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encode_cats', OneHotEncoder(), dummies),],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(example), columns = ct.get_feature_names_out())
encode_cats__cat1_A encode_cats__cat1_B ... encode_cats__cat2_w remainder__x
0 1.0 0.0 ... 0.0 2.0
1 0.0 1.0 ... 0.0 4.0
2 1.0 0.0 ... 1.0 6.0
3 0.0 1.0 ... 0.0 8.0
4 0.0 0.0 ... 0.0 10.0
5 1.0 0.0 ... 1.0 12.0
Finally, slot this into a pipeline:
from sklearn.pipeline import Pipeline
Pipeline([('preprocessing', ct),
('regressor', LinearRegression())])

Categories