How to create a for loop for OneHotEncoder - python

I have a list of categorical columns within my dataframe that I am trying to OneHotEncode. I've used the following code for each of these columns individually, but cannot figure out how to iterate through my categoricals list to do the same. Does anyone know how to do this?
categoricals = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import OneHotEncoder
bedrooms = df[['bedrooms']]
bed = OneHotEncoder(categories="auto", sparse=False, handle_unknown="ignore")
bed.fit(bedrooms)
bed_encoded = bed.transform(bedrooms)
bed_encoded = pd.DataFrame(
bed_encoded,
columns=bed.categories_[0],
index=df.index
)
df.drop("bedrooms", axis=1, inplace=True)
df = pd.concat([df, bed_encoded], axis=1)

Method:1
Create the DataFrame first. You can use an ordinal encoder like label-encoder first and then do the one-hot encoding.
categorical_cols = ['bedrooms', 'bathrooms', 'floors', 'condition', 'grade',
'yr_built']
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
# data is the dataframe
data[categorical_cols] = data[categorical_cols].apply(lambda col:
le.fit_transform(col))
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
#One-hot-encode the categorical columns.
#outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(data[categorical_cols])
#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=data.index)
#Extract only the columns that are numeric and don't need to be encoded
data_numeric_cols = data.drop(columns=categorical_cols)
#Concatenate the two dataframes :
data_out = pd.concat([data_hot_encoded, data_numeric_cols], axis=1)
You could also use pd.factorize() to ap the categorical data mapped to ordinal data.
Method:2
Use pd.get_dummies() so that you can do one-hot encoding directly from raw data.(don't have to convert into ordinal data)
import pandas as pd
df = pd.get_dummies(data, columns = categorical_cols)

Related

Standardizing a set of columns in a pandas dataframe with sklearn

I have a table with four columns: CustomerID, Recency, Frequency and Revenue.
I need to standardize (scale) the columns Recency, Frequency and Revenue and save the column CustomerID.
I used this code:
from sklearn.preprocessing import normalize, StandardScaler
df.set_index('CustomerID', inplace = True)
standard_scaler = StandardScaler()
df = standard_scaler.fit_transform(df)
df = pd.DataFrame(data = df, columns = ['Recency', 'Frequency','Revenue'])
But the result is a table without the column CustomerID. Is there any way to get a table with the corresponding CustomerID and the scaled columns?
fit_transform returns an ndarray with no indices, so you are losing the index you set on df.set_index('CustomerID', inplace = True).
Instead of doing this, you can simply take the subset of columns you need to transform, pass them to StandardScaler, and overwrite the original columns.
# Subset of columns to transform
cols = ['Recency','Frequency','Revenue']
# Overwrite old columns with transformed columns
df[cols] = StandardScaler.fit_transform(df[cols])
This way, you leave CustomerID completely unchanged.
You can use scale to standardize specific columns:
from sklearn.preprocessing import scale
cols = ['Recency', 'Frequency', 'Revenue']
df[cols] = scale(df[cols])
You can use this metod:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df[:, 3:] = sc.fit_transform(df[:, 1:])

Factorize columns in two different DataFrames

I have two DataFrames and in each of them I have a categorical column col. I want to replace all the categories with numbers, so I decided to do it this fashion:
df1['col'] = pd.factorize(df1['col'])[0]
Now the question is how can I code df2[col] in the same way? And how can I also code categories that are present indf2[col] but not in df1[col]?
You need a LabelEncoder
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df1['col'] = enc.fit_transform(df1['col'])
df2['col'] = enc.transform(df2['col'])
for unseen label, this may be a solution:
enc = LabelEncoder()
enc.fit(df1['col'])
diz_map = dict(zip(enc.classes_, enc.transform(enc.classes_)+1))
for i in set(df2['col']).difference(df1['col']):
diz_map[i] = 0
df1['col'] = [diz_map[i] for i in df1['col'].values]
df2['col'] = [diz_map[i] for i in df2['col'].values]
you map as 0 all the unseen values in df2['col']

CSV normalization in Python

I'm working on a CSV file which contains several medical data and I want to implement it for ML model, but before executing the ML model, I want to normalize the data between 0 to 1. Below is my script, but it's producing error, how to resolve the error
Sample input file
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from pandas import read_csv
Data = ('Medical_Data.csv')
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(Data, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
Error massage:
could not convert string to float: 'preg'
You're performing pd.read_csv twice. Data will be in a DataFrame format and you cannot perform pd.read_csv on a DataFrame.
---- UPDATE
names needs to be defined before read_csv. Please refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html .
import pandas as pd
import scipy as sp
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from pandas import read_csv
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('Medical_Data.csv', names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
You don`t have to use 'pd.read_csv' twice. Just use like that:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Data = pd.read_csv('Medical_Data.csv',names=names)
and also if you want to get DataFrame's columns, use this code:
columns = Data.columns
'Data.columns' will return column list

Question regarding OneHotEncoding - Python

I'm working on a project applying the One Hot Encoding technique to a categorical column of a .binetflow file.
CODE:
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
mydataset = pd.read_csv('originalfiletest.binetflow')
le = LabelEncoder()
dfle = mydataset
dfle.State = le.fit_transform(dfle.State)
X = dfle[['State']].values
ohe = OneHotEncoder()
Onehot = ohe.fit_transform(X).toarray()
dfle['State'] = Onehot
mydataset.to_csv('newfiletest.binetflow', columns=['Dur','State','TotBytes','average_packet_size','average_bits_psecond'], index=False)
Original binetflow file
Currently, I'm using Pandas and I'm being able to apply the technique. The problem is when I need to write in the second file.
When I try to write, the output I'm expecting is, for example: 0001 or 0.0.0.1 in variable Onehot, but what I get is either 0.0 or 1.0 when I try to pass it to the column dfle['State'].
The images can be found bellow.
variable Onehot
column dfle['State']
Moreover, the column that should just write, when I write print on the compiler it shows correctly but when it writes in the file it adds a few decimal places.
Original and new binetflow file
Onehot is numpy array and the problem is in your assignment of the array to dataframe column
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
mydataset = pd.DataFrame(data={'State': ['a', 'a', 'b', 'c', 'a', 'd']})
le = LabelEncoder()
mydataset.State = le.fit_transform(mydataset.State)
X = mydataset[['State']].values
ohe = OneHotEncoder()
Onehot = ohe.fit_transform(X).toarray()
dx = pd.DataFrame(data=Onehot)
mydataset['State'] = (dx[dx.columns[0:]].apply(lambda x: ','.join(x.dropna().astype(int).astype(str)), axis=1))
mydataset.to_csv('newfiletest.binetflow',
columns=['Dur', 'State', 'TotBytes', 'average_packet_size', 'average_bits_psecond'], index=False)

How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want.
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features = autoscaler.fit_transform(features)
A "solution" I found online is:
features = features.apply(lambda x: autoscaler.fit_transform(x))
It appears to work, but leads to a deprecationwarning:
/usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
I therefore tried:
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1)))
But this gives:
Traceback (most recent call last): File "./analyse.py", line 91, in
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1))) File
"/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 3972, in
apply
return self._apply_standard(f, axis, reduce=reduce) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 4081, in
_apply_standard
result = self._constructor(data=results, index=index) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 226, in
init
mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 363, in
_init_dict
dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5163, in
_arrays_to_mgr
arrays = _homogenize(arrays, index, dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5477, in
_homogenize
raise_cast_failure=False) File "/usr/lib/python3.5/site-packages/pandas/core/series.py", line 2885,
in _sanitize_array
raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional
How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.
You could convert the DataFrame as a numpy array using as_matrix(). Example on a random dataset:
Edit:
Changing as_matrix() to values, (it doesn't change the result) per the last sentence of the as_matrix() docs above:
Generally, it is recommended to use ‘.values’.
import pandas as pd
import numpy as np #for the random integer example
df = pd.DataFrame(np.random.randint(0.0,100.0,size=(10,4)),
index=range(10,20),
columns=['col1','col2','col3','col4'],
dtype='float64')
Note, indices are 10-19:
In [14]: df.head(3)
Out[14]:
col1 col2 col3 col4
10 3 38 86 65
11 98 3 66 68
12 88 46 35 68
Now fit_transform the DataFrame to get the scaled_features array:
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(df.values)
In [15]: scaled_features[:3,:] #lost the indices
Out[15]:
array([[-1.89007341, 0.05636005, 1.74514417, 0.46669562],
[ 1.26558518, -1.35264122, 0.82178747, 0.59282958],
[ 0.93341059, 0.37841748, -0.60941542, 0.59282958]])
Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names:
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
In [17]: scaled_features_df.head(3)
Out[17]:
col1 col2 col3 col4
10 -1.890073 0.056360 1.745144 0.466696
11 1.265585 -1.352641 0.821787 0.592830
12 0.933411 0.378417 -0.609415 0.592830
Edit 2:
Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed.
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(df.columns, StandardScaler())])
scaled_features = mapper.fit_transform(df.copy(), 4)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('your file here')
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df),columns = df.columns)
The df_scaled will be the 'same' dataframe, only now with the scaled values
Reassigning back to df.values preserves both index and columns.
df.values[:] = StandardScaler().fit_transform(df)
features = ["col1", "col2", "col3", "col4"]
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])
This worked with MinMaxScaler in getting back the array values to original dataframe. It should work on StandardScaler as well.
data_scaled = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
where, data_scaled is the new data frame, scaled_features = the array post normalization, df = original dataframe for which we need the index and columns back.
Works for me:
from sklearn.preprocessing import StandardScaler
cols = list(train_df_x_num.columns)
scaler = StandardScaler()
train_df_x_num[cols] = scaler.fit_transform(train_df_x_num[cols])
This is what I did:
X.Column1 = StandardScaler().fit_transform(X.Column1.values.reshape(-1, 1))
Since sklearn Version 1.2, estiamtors can return a DataFrame keeping the column names.
set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas")
See Release Highlights for scikit-learn 1.2 - Pandas output with set_output API
Example for set_output():
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")
Example for set_config():
from sklearn import set_config
set_config(transform_output="pandas")
You can mix multiple data types in scikit-learn using Neuraxle:
Option 1: discard the row names and column names
from neuraxle.pipeline import Pipeline
from neuraxle.base import NonFittableMixin, BaseStep
class PandasToNumpy(NonFittableMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
return data_inputs.values
pipeline = Pipeline([
PandasToNumpy(),
StandardScaler(),
])
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
pipeline, scaled_features = pipeline.fit_transform(features)
Option 2: to keep the original column names and row names
You could even do this with a wrapper as such:
from neuraxle.pipeline import Pipeline
from neuraxle.base import MetaStepMixin, BaseStep
class PandasValuesChangerOf(MetaStepMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
new_data_inputs = self.wrapped.transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return new_data_inputs
def fit_transform(self, data_inputs, expected_outputs):
self.wrapped, new_data_inputs = self.wrapped.fit_transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return self, new_data_inputs
def _merge(self, data_inputs, new_data_inputs):
new_data_inputs = pd.DataFrame(
new_data_inputs,
index=data_inputs.index,
columns=data_inputs.columns
)
return new_data_inputs
df_scaler = PandasValuesChangerOf(StandardScaler())
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
df_scaler, scaled_features = df_scaler.fit_transform(features)
You can try this code, this will give you a DataFrame with indexes
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston # boston housing dataset
dt= load_boston().data
col= load_boston().feature_names
# Make a dataframe
df = pd.DataFrame(data=dt, columns=col)
# define a method to scale data, looping thru the columns, and passing a scaler
def scale_data(data, columns, scaler):
for col in columns:
data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
return data
# specify a scaler, and call the method on boston data
scaler = StandardScaler()
df_scaled = scale_data(df, col, scaler)
# view first 10 rows of the scaled dataframe
df_scaled[0:10]
You could directly assign a numpy array to a data frame by using slicing.
from sklearn.preprocessing import StandardScaler
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features[:] = autoscaler.fit_transform(features.values)

Categories