Scikit-learn Imputer Reducing Dimensions - python

I have a dataframe with 332 columns. I want to impute values to be able to use scikit-learn's decision tree classifier. My problem is that the column of the resulting data from imputer function is only 330.
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
cols = data.columns
new = imp.fit_transform(data)
print(data.shape,new.shape)
(34132, 332) (34132, 330)

According to the documentation of sklearn.preprocessing.Imputer:
When axis=0, columns which only contained missing values at fit are discarded upon transform.
So, this is removing all-missing-value columns.

Related

How to setup the Imputer as part of sklearn pipeline?

I am working on the Titanic dataset and I wish to handle all the preprocessing activities on a pipeline. So, here is my code:
To get the dataset
!wget "https://calmcode.io/datasets/titanic.csv"
And then read it as below:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
dt = pd.read_csv("./data/titanic.csv", index_col=["PassengerId"])
And then I setup a single pipeline which is suppose to preprocess the numerical features:
numerical_features = ["Age", "SibSp", "Parch", "Fare"]
numerical_pipeline = Pipeline(steps=[("min_max_scaler", MinMaxScaler()),
('num_imputer',SimpleImputer(missing_values=np.nan, strategy='mean')])
Then fit the pipeline:
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numerical_pipeline, numerical_features),remainder='drop')
column_transformer.fit(dt)
transformed_dt = column_transformer.transform(dt)
But, I need to apply the Imputer only in the Age feature and not in all the other columns.Currently, it applies the imputer over all the columns.
My question is :
How can I specify that I need to apply the SimpleImputer only on the Age column and not in all of the numerical_pipeline ?
I think you need to use two column transformers, so if you set up the minmax this way:
minmax = ColumnTransformer([(
"minmax",
MinMaxScaler(),
["age", "sibsp", "parch", "fare"])
],remainder='drop')
The output comes without column names, but based on the column names we input, age will be the first, so:
imp = ColumnTransformer([(
"impute",
SimpleImputer(missing_values=np.nan, strategy='mean'),
[0])
],remainder='passthrough')
Then into a pipeline:
Pipeline([("scale",minmax),("impute",imp)]).fit_transform(dt)
As you have said in a comment, you want to first impute and second do the scaling. In that case, I would first create a column transformer that only imputes the one column, passes through the three others numerical columns, and drops columns that are not part of that set. After that, you add a MinMaxScaler on the output of that column transformer. In code
ct = ColumnTransformer(
[
("num_imputer", SimpleImputer(missing_values=np.nan, strategy="mean"), ["Age"]),
("needed_columns", "passthrough", ["SibSp", "Parch", "Fare"]),
],
)
pipeline = Pipeline(steps=[("transform", ct), ("scale", MinMaxScaler())])
The important bit here is that you add a second entry to the list of transformers, that has the word "passthrough" and specifies all the columns that should be passed through without any modifications.

Python's "StandardScaler" and "LabelEncoder", and "fit" and "fit_transform" do not work with a CSV which contains both float and string

I was learning the MPL regressor at Google Colaboratory and ran the source code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array(table)
scaler.fit(data)
y_index = data.shape[1]-1
sd_x = (scaler.var_[:y_index])**0.5
sd_y = (scaler.var_[y_index])**0.5
mean_x = scaler.mean_[:y_index]
mean_y = scaler.mean_[y_index]
x = (data[:, :y_index]).astype(np.float32)
y = (data[:, y_index]).astype(np.float32)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25)
print('Separate training and testing sets!')
It gave the error ValueError: could not convert string to float: 'Photo Editor & Candy Camera & Grid & ScrapBook'.
So I checked the question RandomForestClassfier.fit(): ValueError: could not convert string to float. I also tried sklearn-LinearRegression: could not convert string to float: '--'.
I changed from fit(data) to fit_transform(data), but the same error still insisted. Then I changed from StandardScaler to LabelEncoder, and from scaler = StandardScaler() to scaler = LabelEncoder(). But the different error appeared: ValueError: bad input shape (10841, 13) on the line scaler.fit_transform(data).
You can check the CSV from Kaggle's CSV here. The CSV contains both strings and numbers without quotation marks (except the prices which contain double quotation marks).
From the documentation of sklearn's LabelEncoder: This transformer should be used to encode target values, i.e. y, and not the input X.
Particularly, it's not intended to fit a LabelEncoder on the full dataset.
If you just want to replace the values of the categorical (i.e, string-valued) columns by unique and numeric ids, one way to go is to apply the label encoder (before splitting the data) on each column you want to encode individually. As your sample code imports pandas, I assume that your data has been loaded into a pandas.DataFrame like
df = pd.read_csv('/path/to/googleplaystore.csv')
From there, you can apply the encoder on each column:
df['App'] = LabelEncoder().fit_transform(df['App'].values)
You may also want to have a look how to handle categorical data within pandas.
However, even after doing this for each non-numeric column in your dataset, there is still a long way before fitting a model on the encoded data (you may want to apply one-hot encoding onto these columns afterwards, but this heavily depends on the model you want to use).
StandardScaler is a preprocessing class from sklearn that takes numeric entries and convert them to a likely Gaussian distribution with 0 mean and unit variance. It doesn't deal with text data. That explains the first error.
LabelEncoder is another preprocessing class from sklearn that takes data and maps them to a numeric encoded representation.
Ex: ["apple","banana","apple","banana"] to [0,1,0,1]
Your dataset has missing values, you should deal with them first. By means of imputing, droping or some similar approach.
Then you should convert the types (all but rating are considered object or string) from each column to handle properly each datatype.
table = pd.read_csv('googleplaystore.csv')
# check dataset info
table.info()
# check missing values
table.isna().sum()
To be honest, I think this is more of a conceptual problem than a technical one. As other users told you, StandarScaler must be used on numeric columns but most of your dataframe columns are object type. Probably you should use OneHotEncoder on it, all transformer on sklearn have a similar behaviour.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(X) # your data without target column
# ...blabla...
Finally, I recommend you read about Pipelines from sklearn, I think they are more elegant than a lot of messy code. You can put preprocessing and model steps on the same pipeline, for example here.

SettingWithCopy Warning when Standardizing Only Numeric Columns in Pandas DataFrame with Sklearn [duplicate]

This question already has an answer here:
Creating a new column for predicted cluster: SettingWithCopyWarning
(1 answer)
Closed 1 year ago.
I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.
I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.
My process is as follows with some toy data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List
# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
['1',125,'A',15],
['2',134,'A',20],
['3',112,'A',25],
['4',107,'B',35],
['5',68,'B',50],
['6',321,'B',10],
['7',26,'B',27],
['8',115,'C',64],
['9',100,'C',72],
['10',74,'C',18],
['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id object
weight int64
type object
age int64
dtype: object
# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()
# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.
You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.
The second approach would prevent information leak as commented in your code. So something like this:
# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy() # don't you drop the label?
# y: pd.Series = df.pop('type') # y = df['type']
# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'],
test_size = 0.2,
random_state = 0)
# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()
## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
I try to explain both pd.get_dummies() and OneHotEncoder() for transforming categorical data into dummy columns. But I do recommend using OneHotEncoder() transformer, because it's a sklearn transformer that you can use it in a Pipeline later if you want.
First OneHotEncoder(): It does the same job as pd.get_dummies function of pandas does, but return of this class is a Numpy ndarray or a sparse array. you can read more about this class here:
from sklearn.preprocessing import OneHotEncoder
X_train_cat = X_train[["type"]]
cat_encoder = OneHotEncoder(sparse=False)
X_train_cat_1hot = cat_encoder.fit_transform(X_train) #This is a numpy ndarray!
#If you want to make a DataFrame again, you can do so like below:
#X_train_cat_1hot = pd.DataFrame(X_train_cat_1hot, columns=cat_encoder.categories_[0])
#You can also concatenate this transformed dataframe with your numerical transformed one.
Second method, pd.get_dummies():
df_dummies = pd.get_dummies(X_train[["type"]])
X_train = pd.concat([X_train, df_dummies], axis=1).drop("type", axis=1)

Using Keras StandardScaler with Groupby function in Pandas

I have a pandas data frame with multiple columns. I need to use groupby function on each column and after use Keras StandardScaler function to transform each column in the dataframe. I tried the following code:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
df2= df.groupby('Sector').apply(lambda x: scaler.fit_transform(x.astype(float)))
but it returns lists of data by group, however, I need to preserve the initial structure of the dataframe.
I specifically need to use StandardScaler because afterwards I want to use it to transform testing features.
Is there a way to use StandardScaler in this case?
I don't understand why you are using group by.
To get a copy of data
scaled_features = data.copy()
To scale only a few columns
features = scaled_features[['column1','column2']]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
To transform columns also you can do it in your way. Or also using a query you can get only these data where the value matches your query.
df_c = df['column'].apply(lambda x: scaler.fit_transform(x.astype(float)))
df_cc = df[df['column'] == '...'].apply((lambda x: scaler.fit_transform(x.astype(float)))
Another way is to use ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer([
('name', StandardScaler(), ['column1', 'column2'])
], remainder='passthrough')
ct.fit_transform(features)
I think there is no such option in the StandardScaler library but following pandas commands can be of help.
# Normalize around mean
df['normal'] = df.groupby('Group').transform(lambda x: (x - x.mean()/ x.std()))
# Normalize between 0 and 1
df['normal'] = df.groupby('Group').transform(lambda x: ((x - x.min())/ (x.max() - x.min())))
Regarding testing data, I think you should do the normalization before and then split it into train and test.

How to use OneHotEncoder and Pipeline to make new predictions?

I'm working through a tutorial focusing on OneHotEncoder. I get the idea behind encoding features, but I'm having a little problem with using the encoder with pipeline to make a new prediction. Two of the features--"Sex" and "Embarked"--are categorical rather than numerical. When creating a new numpy array to make a prediction, do you use the initial values, say "male" and "C", or, say, "1" and "2" to make a new prediction? I get the following error: " ValueError: Specifying the columns using strings is only supported for pandas DataFrames," which is weird given that the values I'm using are numerical. Regardless, would I have to fit the pipeline to X_new to make a new prediction? If so, how can I do that?
X_new = [[3, 1, 0]] OR X_new = [['3','male', 'C']]
pipe.predict(X_new)
Complete code:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/---/pandas-videos/master/data/titanic_train.csv")
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs')
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X = df.drop('Survived', axis='columns')
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
column_trans.fit_transform(X)
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(column_trans, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
X_new = [[3, 1, 0]]
pipe.predict(X_new)
When you apply OneHotEncoder, the categorical column that you specify will be transformed into multiple integer columns based on number of unique value in the categorical column.
For example, the gender column contains 'male' and 'female', then it will converted the original column to 2 columns of 'male' and 'female'. It is difference from the LabelEncoder.
If you want to apply pipeline, logistic regression, and OneHotEncoder, you can use the pipeline to fit with the training data.
pipe.fit(X,y)
and then you can apply the prediction. This is an example when I apply 3 features as Sex, Age, and embarked and apply OHE to Sex and embarked.
X_new = [['female', 20, 'C']]
X_new_df = pd.DataFrame (X_new,columns=['Sex','Age','Embarked'])
pipe.predict(X_new_df)
However, the features that you use in your code is all features except label classes ('Survived'), which is 11 features. the number of the input must be equal to or greater than the fitted model, while you apply only 3 columns that may prompt and error.

Categories