apply multilabel binarizer to multiple columns - python

I know how to apply multilabel binarizer on 1 column and it works for me.
For example, I do something like
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
binarized_df = pd.DataFrame(mlb.fit_transform(df['One']), columns=mlb.classes_, index=df.index)
However, I have 20 different columns to which I want to apply binarizer, and if I want to apply to all of them together, it does not work
cols = ['One', 'Two',...'Twenty']
binarized_df = pd.DataFrame(mlb.fit_transform(df[cols]), columns=mlb.classes_, index=df.index)
Is there a way to use multilabel binarizer for multiple columns all together?

Since sklearn.preprocessing.MultiLabelBinarizer accepts an array-like value for classes, have you tried just adding pandas.DataFrame.values or pandas.DataFrame.to_numpy to your code ?
classes : array-like of shape (n_classes,), default=None
cols = ['One', 'Two',...'Twenty']
binarized_df = pd.DataFrame(mlb.fit_transform(df[cols].values),
columns=mlb.classes_, index=df.index)
By the way, pandas.get_dummies might also work for you :
binarized_df = pd.get_dummies(df[cols], columns=mlb.classes_, index=df.index)

Related

How to create synthetic data based on dataset with mixed data types for classification problem?

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y
# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df = X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)
# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']
label_encoders = {}
for c in cat_cols:
cat_proc = preprocessing.LabelEncoder()
col_proc = cat_proc.fit_transform(df[c])
df[c] = col_proc
label_encoders[c] = cat_proc
# Fit a copula
copula = VineCopula('regular')
copula.fit(df)
# Sample synthetic data
df_synthetic = copula.sample(1000)
All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.
To have your columns converted to ints, use round and then .astype(int):
df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)
You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.

How to create a for loop for OneHotEncoder

I have a list of categorical columns within my dataframe that I am trying to OneHotEncode. I've used the following code for each of these columns individually, but cannot figure out how to iterate through my categoricals list to do the same. Does anyone know how to do this?
categoricals = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import OneHotEncoder
bedrooms = df[['bedrooms']]
bed = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
bed.fit(bedrooms)
bed_encoded = bed.transform(bedrooms)
bed_encoded = pd.DataFrame(
bed_encoded,
columns=bed.categories_[0],
index=df.index
)
df.drop("bedrooms", axis=1, inplace=True)
df = pd.concat([df, bed_encoded], axis=1)
Method:1
Create the DataFrame first. You can use an ordinal encoder like label-encoder first and then do the one-hot encoding.
categorical_cols = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
# data is the dataframe
data[categorical_cols] = data[categorical_cols].apply(lambda col:
le.fit_transform(col))
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
#One-hot-encode the categorical columns.
#outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(data[categorical_cols])
#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=data.index)
#Extract only the columns that are numeric and don't need to be encoded
data_numeric_cols = data.drop(columns=categorical_cols)
#Concatenate the two dataframes :
data_out = pd.concat([data_hot_encoded, data_numeric_cols], axis=1)
You could also use pd.factorize() to ap the categorical data mapped to ordinal data.
Method:2
Use pd.get_dummies() so that you can do one-hot encoding directly from raw data.(don't have to convert into ordinal data)
import pandas as pd
df = pd.get_dummies(data, columns = categorical_cols)

Standardizing a set of columns in a pandas dataframe with sklearn

I have a table with four columns: CustomerID, Recency, Frequency and Revenue.
I need to standardize (scale) the columns Recency, Frequency and Revenue and save the column CustomerID.
I used this code:
from sklearn.preprocessing import normalize, StandardScaler
df.set_index('CustomerID', inplace = True)
standard_scaler = StandardScaler()
df = standard_scaler.fit_transform(df)
df = pd.DataFrame(data = df, columns = ['Recency', 'Frequency','Revenue'])
But the result is a table without the column CustomerID. Is there any way to get a table with the corresponding CustomerID and the scaled columns?
fit_transform returns an ndarray with no indices, so you are losing the index you set on df.set_index('CustomerID', inplace = True).
Instead of doing this, you can simply take the subset of columns you need to transform, pass them to StandardScaler, and overwrite the original columns.
# Subset of columns to transform
cols = ['Recency','Frequency','Revenue']
# Overwrite old columns with transformed columns
df[cols] = StandardScaler.fit_transform(df[cols])
This way, you leave CustomerID completely unchanged.
You can use scale to standardize specific columns:
from sklearn.preprocessing import scale
cols = ['Recency', 'Frequency', 'Revenue']
df[cols] = scale(df[cols])
You can use this metod:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df[:, 3:] = sc.fit_transform(df[:, 1:])

Factorize columns in two different DataFrames

I have two DataFrames and in each of them I have a categorical column col. I want to replace all the categories with numbers, so I decided to do it this fashion:
df1['col'] = pd.factorize(df1['col'])[0]
Now the question is how can I code df2[col] in the same way? And how can I also code categories that are present indf2[col] but not in df1[col]?
You need a LabelEncoder
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df1['col'] = enc.fit_transform(df1['col'])
df2['col'] = enc.transform(df2['col'])
for unseen label, this may be a solution:
enc = LabelEncoder()
enc.fit(df1['col'])
diz_map = dict(zip(enc.classes_, enc.transform(enc.classes_)+1))
for i in set(df2['col']).difference(df1['col']):
diz_map[i] = 0
df1['col'] = [diz_map[i] for i in df1['col'].values]
df2['col'] = [diz_map[i] for i in df2['col'].values]
you map as 0 all the unseen values in df2['col']

How to transform some columns only with SimpleImputer or equivalent

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.
I have read carefully the documentation but I still cannot figure out how to achieve this.
To make this more specific, let's say I have:
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?
An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?
There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well
For the second column, use
column.fillna(column.mean(), inplace=True)
For the third column, use
column.fillna(constant, inplace=True)
Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.
Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be
column = column.fillna(column.mean())
Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')
print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
# [4 3.5 6]
# [10 5.0 29]]
This approach helps with constructing pipelines which are more suitable for larger applications.
This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).
#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and
df[cname].dtype == "object"]
Why thes columns, it's recommanded, to encode only cols with cardinality near 10.
# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols] = imp.fit_transform(df[low_cardinality_cols])
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
df[col] = label_encoder.fit_transform(df[col])
```
I am assuming you have your data as a pandas dataframe.
In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.
An example of this is,
## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))
Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

Categories