I have two DataFrames and in each of them I have a categorical column col. I want to replace all the categories with numbers, so I decided to do it this fashion:
df1['col'] = pd.factorize(df1['col'])[0]
Now the question is how can I code df2[col] in the same way? And how can I also code categories that are present indf2[col] but not in df1[col]?
You need a LabelEncoder
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df1['col'] = enc.fit_transform(df1['col'])
df2['col'] = enc.transform(df2['col'])
for unseen label, this may be a solution:
enc = LabelEncoder()
enc.fit(df1['col'])
diz_map = dict(zip(enc.classes_, enc.transform(enc.classes_)+1))
for i in set(df2['col']).difference(df1['col']):
diz_map[i] = 0
df1['col'] = [diz_map[i] for i in df1['col'].values]
df2['col'] = [diz_map[i] for i in df2['col'].values]
you map as 0 all the unseen values in df2['col']
Related
I know how to apply multilabel binarizer on 1 column and it works for me.
For example, I do something like
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
binarized_df = pd.DataFrame(mlb.fit_transform(df['One']), columns=mlb.classes_, index=df.index)
However, I have 20 different columns to which I want to apply binarizer, and if I want to apply to all of them together, it does not work
cols = ['One', 'Two',...'Twenty']
binarized_df = pd.DataFrame(mlb.fit_transform(df[cols]), columns=mlb.classes_, index=df.index)
Is there a way to use multilabel binarizer for multiple columns all together?
Since sklearn.preprocessing.MultiLabelBinarizer accepts an array-like value for classes, have you tried just adding pandas.DataFrame.values or pandas.DataFrame.to_numpy to your code ?
classes : array-like of shape (n_classes,), default=None
cols = ['One', 'Two',...'Twenty']
binarized_df = pd.DataFrame(mlb.fit_transform(df[cols].values),
columns=mlb.classes_, index=df.index)
By the way, pandas.get_dummies might also work for you :
binarized_df = pd.get_dummies(df[cols], columns=mlb.classes_, index=df.index)
I have a list of categorical columns within my dataframe that I am trying to OneHotEncode. I've used the following code for each of these columns individually, but cannot figure out how to iterate through my categoricals list to do the same. Does anyone know how to do this?
categoricals = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import OneHotEncoder
bedrooms = df[['bedrooms']]
bed = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
bed.fit(bedrooms)
bed_encoded = bed.transform(bedrooms)
bed_encoded = pd.DataFrame(
bed_encoded,
columns=bed.categories_[0],
index=df.index
)
df.drop("bedrooms", axis=1, inplace=True)
df = pd.concat([df, bed_encoded], axis=1)
Method:1
Create the DataFrame first. You can use an ordinal encoder like label-encoder first and then do the one-hot encoding.
categorical_cols = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
# data is the dataframe
data[categorical_cols] = data[categorical_cols].apply(lambda col:
le.fit_transform(col))
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
#One-hot-encode the categorical columns.
#outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(data[categorical_cols])
#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=data.index)
#Extract only the columns that are numeric and don't need to be encoded
data_numeric_cols = data.drop(columns=categorical_cols)
#Concatenate the two dataframes :
data_out = pd.concat([data_hot_encoded, data_numeric_cols], axis=1)
You could also use pd.factorize() to ap the categorical data mapped to ordinal data.
Method:2
Use pd.get_dummies() so that you can do one-hot encoding directly from raw data.(don't have to convert into ordinal data)
import pandas as pd
df = pd.get_dummies(data, columns = categorical_cols)
I have a table with four columns: CustomerID, Recency, Frequency and Revenue.
I need to standardize (scale) the columns Recency, Frequency and Revenue and save the column CustomerID.
I used this code:
from sklearn.preprocessing import normalize, StandardScaler
df.set_index('CustomerID', inplace = True)
standard_scaler = StandardScaler()
df = standard_scaler.fit_transform(df)
df = pd.DataFrame(data = df, columns = ['Recency', 'Frequency','Revenue'])
But the result is a table without the column CustomerID. Is there any way to get a table with the corresponding CustomerID and the scaled columns?
fit_transform returns an ndarray with no indices, so you are losing the index you set on df.set_index('CustomerID', inplace = True).
Instead of doing this, you can simply take the subset of columns you need to transform, pass them to StandardScaler, and overwrite the original columns.
# Subset of columns to transform
cols = ['Recency','Frequency','Revenue']
# Overwrite old columns with transformed columns
df[cols] = StandardScaler.fit_transform(df[cols])
This way, you leave CustomerID completely unchanged.
You can use scale to standardize specific columns:
from sklearn.preprocessing import scale
cols = ['Recency', 'Frequency', 'Revenue']
df[cols] = scale(df[cols])
You can use this metod:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df[:, 3:] = sc.fit_transform(df[:, 1:])
I am using below code in python to one-hot encode one of many categorical variables in my dataset. However, I want to encode multiple columns in one go but unable to do so. Also, these multiple columns have different # of categories Eg; one might have just a Yes and No but other columns have 4-5 different categories. How can I encode it all together using below code and append it in the main dataset?
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = df[["column-name"]]
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
You can easily achieve what you want to do using pandas get_dummies() function. Try executing this code:
data = pd.get_dummies(data)
This will encode all the categorical variables and append it in the main dataframe.
I'm working on a project applying the One Hot Encoding technique to a categorical column of a .binetflow file.
CODE:
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
mydataset = pd.read_csv('originalfiletest.binetflow')
le = LabelEncoder()
dfle = mydataset
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
ohe = OneHotEncoder()
Onehot = ohe.fit_transform(X).toarray()
dfle['State'] = Onehot
mydataset.to_csv('newfiletest.binetflow', columns=['Dur','State','TotBytes','average_packet_size','average_bits_psecond'], index=False)
Original binetflow file
Currently, I'm using Pandas and I'm being able to apply the technique. The problem is when I need to write in the second file.
When I try to write, the output I'm expecting is, for example: 0001 or 0.0.0.1 in variable Onehot, but what I get is either 0.0 or 1.0 when I try to pass it to the column dfle['State'].
The images can be found bellow.
variable Onehot
column dfle['State']
Moreover, the column that should just write, when I write print on the compiler it shows correctly but when it writes in the file it adds a few decimal places.
Original and new binetflow file
Onehot is numpy array and the problem is in your assignment of the array to dataframe column
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
mydataset = pd.DataFrame(data={'State': ['a', 'a', 'b', 'c', 'a', 'd']})
le = LabelEncoder()
mydataset.State = le.fit_transform(mydataset.State)
X = mydataset[['State']].values
ohe = OneHotEncoder()
Onehot = ohe.fit_transform(X).toarray()
dx = pd.DataFrame(data=Onehot)
mydataset['State'] = (dx[dx.columns[0:]].apply(lambda x: ','.join(x.dropna().astype(int).astype(str)), axis=1))
mydataset.to_csv('newfiletest.binetflow',
columns=['Dur', 'State', 'TotBytes', 'average_packet_size', 'average_bits_psecond'], index=False)