How to classify and substitute NaN with a pipeline - python

I have the following data frame:
import pandas as pd
d = {'hrs': [1, "NaN", 2], 'Department': ["ResearchDevelopment", "NaN", "ResearchDevelopment"]}
df = pd.DataFrame(data=d)
df
And I want to use a pipeline in order to classify it as categorical or numerical and predict NaN values as the median in the case of numerical and ignore in the case of categorical.
I have done the following:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import pipeline
numeric_features = ["hrs"]
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_features = ["Department"]
categorical_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="most_frequent")),("OneHotEncoder", OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
But I do not know where is the error, and if I have to call the data frame or not.
No error happens, but the pipeline is not changing anything.

Related

Specifying columns in scikit-learn pipeline after ColumnTransformer

I want to construct a scikit-learn pipeline in which some columns have values imputed, and then scaling is subsequently applied to some. If I put both operations in the same columntransformer this does not work as they proceed in parallel (and so missing values cause the scaler to fail). If I make two columntransformers and run them in series, however, I run into the issue that I cannot specify column names (as output of the first transformer is a np array). What is the correct way to go about this?
numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns)+list(X.select_dtypes('int64').columns)
# Imputation
imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(strategy='most_frequent')
imputer = ColumnTransformer(
[('Imput_mean', imp_mean, numeric_columns),
('Imput_freq', imp_freq, cat_columns),
], remainder='passthrough'
)
# Scaling
feature_transformer = ColumnTransformer(
[('num',StandardScaler(),numeric_columns),
], remainder='passthrough'
)
#Hyperparameters
parameters = {'model__n_components':[1,2,3,4,5]}
#Pipeline
pipeline = Pipeline([('imputer', imputer),
('feature_transformer', feature_transformer),
('model', PLSRegression())])
#Cross validation strategy
cv = KFold(n_splits=10, shuffle=True)
#Cross valdiate and evaluate
clf = GridSearchCV(pipeline, parameters, scoring="r2", cv=10)
cross_val_score(clf, X, y, cv=cv, scoring="r2"))
You might nest a Pipeline which takes care of the preprocessing of the numerical columns (performed serially) within the ColumnTransformer instance.
Here's an example:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X = pd.DataFrame({'city': ['London', 'London', '', 'Sallisaw'],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
'expert_rating': [5, 3, np.nan, 5],
'user_rating': [4, np.nan, 4, 3]})
numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns) + list(X.select_dtypes('int64').columns)
imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(missing_values='', strategy='most_frequent')
ct = ColumnTransformer([
('Imput_freq', imp_freq, cat_columns),
('pipe_num', Pipeline([('Imput_mean', imp_mean), ('num', StandardScaler())]), numeric_columns)
], remainder='passthrough'
)
pd.DataFrame(ct.fit_transform(X))
Here's a similar post: How to execute both parallel and serial transformations with sklearn pipeline?.

Sklearn pipeline transform specific columns - ValueError: too many values to unpack (expected 2)

i am trying make pipeline with scaler, onhotencoder, polynomialfeature, and finally linear regression model
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler(), num_cols),
('polynom', PolynomialFeatures(3), num_cols),
('encoder', OneHotEncoder(), cat_cols),
('linear_regression', LinearRegression() )
])
but when i fit the pipeline i have ValueError: too many values to unpack (expected 2)
pipeline.fit(x_train,y_train)
pipeline.score(x_test, y_test)
If I understand correctly, you want to apply some steps of the pipeline to specific columns. Instead of doing it by adding the column names ad the end of the pipeline stage (which is incorrect and causes the error), you have to use a ColumnTransformer. Here you can find another similar example.
In your case, you could do something like this:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
# Fake data.
train_data = pd.DataFrame({'n1': range(10), 'n2': range(10)})
train_data['c1'] = 0
train_data['c1'][5:] = 1
y_train = [0]*10
y_train[5:] = [1]*5
# Here I assumed you are using a DataFrame. If not, use integer indices instead of column names.
num_cols = ['n1', 'n2']
cat_cols = ['c1']
# Pipeline to transform the numerical features.
numerical_transformer = Pipeline([('scaler', StandardScaler()),
('polynom', PolynomialFeatures(3))
])
# Apply the numerical transformer only on the numerical columns.
# Spearately, apply the OneHotEncoder.
ct = ColumnTransformer([('num_transformer', numerical_transformer, num_cols),
('encoder', OneHotEncoder(), cat_cols)])
# Main pipeline for fitting.
pipeline = Pipeline([
('column_transformer', ct),
('linear_regression', LinearRegression() )
])
pipeline.fit(train_data, y_train)
Schematically, the layout of your pipeline would be like this:

OneHotEncoder ValueError: Input contains NaN

I have downloaded this data, and this is my code:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.multiclass import unique_labels
import plotly.figure_factory as ff
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
random_state = 27912
df_train = pd.read_csv("...")
df_test = pd.read_csv("...")
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(["Survived", "Ticket", "Cabin", "Name", "PassengerId"],
axis = 1),
df_train["Survived"], test_size=0.2,
random_state=42)
numeric_col_names = ["Age", "SibSp", "Parch", "Fare"]
ordinal_col_names = ["Pclass"]
one_hot_col_names = ["Embarked", "Sex"]
ct = make_column_transformer(
(SimpleImputer(strategy="median"), numeric_col_names),
(SimpleImputer(strategy="most_frequent"), ordinal_col_names + one_hot_col_names),
(OrdinalEncoder(), ordinal_col_names),
(OneHotEncoder(), one_hot_col_names),
(StandardScaler(), ordinal_col_names + one_hot_col_names + numeric_col_names))
preprocessing_pipeline = Pipeline([("transformers", ct)])
preprocessing_pipeline.fit_transform(X_train)
I'm trying make column_transformer for preprocessing step, however, the OneHotEncoding step is giving me an error, ValueError: Input contains NaN. I don't really know why this is happening, because I'm imputing the values before. Any clues on why this is happening?
Trying something like this doesn't help neither
preprocessing_pipeline = Pipeline([("transformers", ct_first)])
ct_second = make_column_transformer((OneHotEncoder(), one_hot_col_names),(StandardScaler(), ordinal_col_names + one_hot_col_names + numeric_col_names))
pipeline = Pipeline([("transformer1", preprocessing_pipeline), ("transformer2", ct_second)])
pipeline.fit_transform(X_train)
I would like to know why is this happening and why the above code, first and second tries, are not correct.
Thanks
You need to create a pipeline for each column type to make sure that the different steps are applied sequentially (i.e. to make sure that the missing values are imputed prior to encoding and scaling), see also this example in the scikit-learn documentation.
import pandas as pd
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
# Load the data (from https://www.kaggle.com/c/titanic/data)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
# Extract the features
X_train = df_train.drop(labels=['Survived', 'Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
X_test = df_test.drop(labels=['Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
# Map the feature names to the corresponding
# types (numerical, ordinal or categorical)
numeric_col_names = ['Age', 'SibSp', 'Parch', 'Fare']
ordinal_col_names = ['Pclass']
one_hot_col_names = ['Embarked', 'Sex']
# Define the numerical features pipeline
numeric_col_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Define the ordinal features pipeline
ordinal_col_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder()),
('scaler', StandardScaler())
])
# Define the categorical features pipeline
one_hot_col_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(sparse=False)),
('scaler', StandardScaler())
])
# Create the overall preprocessing pipeline
preprocessing_pipeline = make_column_transformer(
(numeric_col_transformer, numeric_col_names),
(ordinal_col_transformer, ordinal_col_names),
(one_hot_col_transformer, one_hot_col_names),
)
# Fit the pipeline to the training data
preprocessing_pipeline.fit(X_train)
# Apply the pipeline to the training and test data
X_train_ = preprocessing_pipeline.transform(X_train)
X_test_ = preprocessing_pipeline.transform(X_test)

How to solve 'Input contains NaN, infinity or a value too large for dtype('float64')' after already preprocessing using Pipeline?

There are many posts containing this error, but I couldn't find the solution for this problem. I'm using this dataset. This is what I've done, a preprocessing, with SimpleImputer for categorical and numerical features:
import pandas as pd
import numpy as np
%load_ext nb_black
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import CatBoostEncoder
from sklearn.model_selection import train_test_split
housing = pd.read_csv("housing.csv")
housing.head()
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", CatBoostEncoder()),
]
)
numeric_features = [
"housing_median_age",
"total_rooms",
"total_bedrooms",
"population",
"households",
"median_income",
]
categorical_features = ["ocean_proximity"]
preprocessor = ColumnTransformer(
transformers=[
("numeric", numeric_transformer, numeric_features),
("categorical", categorical_transformer, categorical_features),
]
)
from sklearn.linear_model import LinearRegression
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
lr_model = pipeline.fit(X_train, y_train)
But I got this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Any idea of what's happening in here?
It seems that the CatBoostEncoder is returning several nan values when fitted to the training set, which is why the LinearRegression throws an error.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from category_encoders import CatBoostEncoder
housing = pd.read_csv("housing.csv")
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", CatBoostEncoder())
])
numeric_features = ["housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income"]
categorical_features = ["ocean_proximity"]
preprocessor = ColumnTransformer(transformers=[
("numeric", numeric_transformer, numeric_features),
("categorical", categorical_transformer, categorical_features),
])
X_new = preprocessor.fit_transform(X_train, y_train)
print(np.isnan(X_new).sum(axis=0))
# array([ 0, 0, 0, 0, 0, 0, 4315])

Applying SimpleImputer and OneHotEncoder to multiple columns at once

I am applying the following code to impute and then encode categorical data in my dataset:
# Encoding categorical data
# Define a Pipeline with an imputing step using SimpleImputer prior to the OneHot encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# use strategy='constant', fill_value='missing' for imputing to preserve the categories' structure
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0])
])
Z = np.array(preprocessor.fit_transform(Z))
print (Z[:,0])
I want to repeat these steps for all columns in the array Z, as Z is comprised of all categorical features from my original dataset.
Is there a more efficient way of doing this rather than listing each column as such:
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0,1,2,3,4,5,6,7,8,9,10])
])
Thanks in advance!
If all columns have the same type, I would simply omit the ColumnTransformer and use a simple pipeline in your case:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
# some sample data
X = pd.DataFrame({
'col1': ['obj1', 'obj2', 'obj3'],
'col2': [np.nan, 'oj3', 'oj1'],
'col3': ['jo3', 'jo1', np.nan]
}).astype('category')
y = pd.Series([0, 1, 1])
pipeline = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='missing'),
OneHotEncoder(handle_unknown='ignore', sparse=False)
)
Z = pipeline.fit_transform(X, y)
The ColumnTransformer is meant to be used for heterogeneous data when columns or column subsets of the input need to be transformed separately (read here). However, since your features are all of the same type and all require the same preprocessing procedure, you can just apply SimpleImputer and OneHotEncoder to the whole dataset as these transformers will automatically detect the columns to transform (which in your case are simply all).

Categories