Using Keras StandardScaler with Groupby function in Pandas - python

I have a pandas data frame with multiple columns. I need to use groupby function on each column and after use Keras StandardScaler function to transform each column in the dataframe. I tried the following code:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
df2= df.groupby('Sector').apply(lambda x: scaler.fit_transform(x.astype(float)))
but it returns lists of data by group, however, I need to preserve the initial structure of the dataframe.
I specifically need to use StandardScaler because afterwards I want to use it to transform testing features.
Is there a way to use StandardScaler in this case?

I don't understand why you are using group by.
To get a copy of data
scaled_features = data.copy()
To scale only a few columns
features = scaled_features[['column1','column2']]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
To transform columns also you can do it in your way. Or also using a query you can get only these data where the value matches your query.
df_c = df['column'].apply(lambda x: scaler.fit_transform(x.astype(float)))
df_cc = df[df['column'] == '...'].apply((lambda x: scaler.fit_transform(x.astype(float)))
Another way is to use ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer([
('name', StandardScaler(), ['column1', 'column2'])
], remainder='passthrough')
ct.fit_transform(features)

I think there is no such option in the StandardScaler library but following pandas commands can be of help.
# Normalize around mean
df['normal'] = df.groupby('Group').transform(lambda x: (x - x.mean()/ x.std()))
# Normalize between 0 and 1
df['normal'] = df.groupby('Group').transform(lambda x: ((x - x.min())/ (x.max() - x.min())))
Regarding testing data, I think you should do the normalization before and then split it into train and test.

Related

How to setup the Imputer as part of sklearn pipeline?

I am working on the Titanic dataset and I wish to handle all the preprocessing activities on a pipeline. So, here is my code:
To get the dataset
!wget "https://calmcode.io/datasets/titanic.csv"
And then read it as below:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
dt = pd.read_csv("./data/titanic.csv", index_col=["PassengerId"])
And then I setup a single pipeline which is suppose to preprocess the numerical features:
numerical_features = ["Age", "SibSp", "Parch", "Fare"]
numerical_pipeline = Pipeline(steps=[("min_max_scaler", MinMaxScaler()),
('num_imputer',SimpleImputer(missing_values=np.nan, strategy='mean')])
Then fit the pipeline:
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numerical_pipeline, numerical_features),remainder='drop')
column_transformer.fit(dt)
transformed_dt = column_transformer.transform(dt)
But, I need to apply the Imputer only in the Age feature and not in all the other columns.Currently, it applies the imputer over all the columns.
My question is :
How can I specify that I need to apply the SimpleImputer only on the Age column and not in all of the numerical_pipeline ?
I think you need to use two column transformers, so if you set up the minmax this way:
minmax = ColumnTransformer([(
"minmax",
MinMaxScaler(),
["age", "sibsp", "parch", "fare"])
],remainder='drop')
The output comes without column names, but based on the column names we input, age will be the first, so:
imp = ColumnTransformer([(
"impute",
SimpleImputer(missing_values=np.nan, strategy='mean'),
[0])
],remainder='passthrough')
Then into a pipeline:
Pipeline([("scale",minmax),("impute",imp)]).fit_transform(dt)
As you have said in a comment, you want to first impute and second do the scaling. In that case, I would first create a column transformer that only imputes the one column, passes through the three others numerical columns, and drops columns that are not part of that set. After that, you add a MinMaxScaler on the output of that column transformer. In code
ct = ColumnTransformer(
[
("num_imputer", SimpleImputer(missing_values=np.nan, strategy="mean"), ["Age"]),
("needed_columns", "passthrough", ["SibSp", "Parch", "Fare"]),
],
)
pipeline = Pipeline(steps=[("transform", ct), ("scale", MinMaxScaler())])
The important bit here is that you add a second entry to the list of transformers, that has the word "passthrough" and specifies all the columns that should be passed through without any modifications.

Accessing column names of a pandas dataframe within a custom transformer in a Sklearn pipeline with ColumnTransformer?

I need to use a custom transformer within a pipeline that acts using the column names. However, the previous pipeline transformations convert the dataframe to a numpy array. I know I can retrieve the column names from the Column Transformer object after the pipeline has been fit, but I need to access the column names within the fit step. The custom transformer in my example below is a simple minimal example for illustration only, not the true transformation.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
class MyCustomTransformer(BaseEstimator, TransformerMixin):
def my_custom_transformation(self, X):
"""
Parameters
----------
X: pandas dataframe
"""
columns_to_keep = [col for col in X.columns if col.endswith(('_a', '_b'))]
return columns_to_keep
def fit(self, X, y=None):
self.columns_to_keep = self.my_custom_transformation(X)
return self
def transform(self, X, y=None):
return X[self.columns]
numeric_transformer = Pipeline(steps=[('minmax_scaler', MinMaxScaler())])
categorical_transformer = Pipeline(steps=[('onehot_encoder', OneHotEncoder(sparse=False))])
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numeric_transformer, ['num']),
('categorical_transformer', categorical_transformer, ['cat']),
])
pipeline = Pipeline(steps=[
('column_transformer', column_transformer),
('my_custom_transformer', MyCustomTransformer())
])
df = pd.DataFrame(data={'num': [1,2,3], 'cat':['a', 'b', 'c']})
pipeline.fit(data_df)
which would ideally result as:
transformed_df = pipeline.transform(df)
print(transformed_df)
>>> num cat_a cat_b
0 0 1 0
1 0.5 0 1
2 1 0 0
The transformations in the column_transformer convert the dataframe to a numpy array, which is then passed to the custom transformer. Obviously this results in an error since you can't get the column names from a numpy array.
I can't use indexing to access the columns since the one-hot encoding can result in an not-previously-known number of columns.
If I could access the ColumnTransformer object within the fit method of the custom transformer, I could retrieve the column names, then create a pandas dataframe to use in the fit method as above (?), but I have not successfully found a way to do this.
Any help would be much appreciated.
See my proposed implementation of a ColumnTransformerWithNames in response to how do i get_feature_names using a column transformer.
You can replace the calls to ColumnTransformer with ColumnTransformerWithNames and the output of the pipeline will be a DataFrame with column names =)
pip install sklearn-pandas-transformers
from sklearn_pandas_transformers.transformers import SklearnPandasWrapper
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', SklearnPandasWrapper(numeric_transformer), ['num']),
('categorical_transformer', SklearnPandasWrapper(categorical_transformer), ['cat']),
])

How to use OneHotEncoder and Pipeline to make new predictions?

I'm working through a tutorial focusing on OneHotEncoder. I get the idea behind encoding features, but I'm having a little problem with using the encoder with pipeline to make a new prediction. Two of the features--"Sex" and "Embarked"--are categorical rather than numerical. When creating a new numpy array to make a prediction, do you use the initial values, say "male" and "C", or, say, "1" and "2" to make a new prediction? I get the following error: " ValueError: Specifying the columns using strings is only supported for pandas DataFrames," which is weird given that the values I'm using are numerical. Regardless, would I have to fit the pipeline to X_new to make a new prediction? If so, how can I do that?
X_new = [[3, 1, 0]] OR X_new = [['3','male', 'C']]
pipe.predict(X_new)
Complete code:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/---/pandas-videos/master/data/titanic_train.csv")
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs')
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X = df.drop('Survived', axis='columns')
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
column_trans.fit_transform(X)
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(column_trans, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
X_new = [[3, 1, 0]]
pipe.predict(X_new)
When you apply OneHotEncoder, the categorical column that you specify will be transformed into multiple integer columns based on number of unique value in the categorical column.
For example, the gender column contains 'male' and 'female', then it will converted the original column to 2 columns of 'male' and 'female'. It is difference from the LabelEncoder.
If you want to apply pipeline, logistic regression, and OneHotEncoder, you can use the pipeline to fit with the training data.
pipe.fit(X,y)
and then you can apply the prediction. This is an example when I apply 3 features as Sex, Age, and embarked and apply OHE to Sex and embarked.
X_new = [['female', 20, 'C']]
X_new_df = pd.DataFrame (X_new,columns=['Sex','Age','Embarked'])
pipe.predict(X_new_df)
However, the features that you use in your code is all features except label classes ('Survived'), which is 11 features. the number of the input must be equal to or greater than the fitted model, while you apply only 3 columns that may prompt and error.

How to leave numerical columns out when using sklearn OneHotEncoder?

Environment:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
Sample data:
X_train = pd.DataFrame({'A': ['a1', 'a3', 'a2'],
'B': ['b2', 'b1', 'b3'],
'C': [1, 2, 3]})
y_train = pd.DataFrame({'Y': [1,0,1]})
Desired outcome:
I would like to include sklearn OneHotEncoder in my pipeline in this format:
encoder = ### SOME CODE ###
scaler = StandardScaler()
model = RandomForestClassifier(random_state=0)
# This is my ideal pipeline
pipe = Pipeline([('OneHotEncoder', encoder),
('Scaler', scaler),
('Classifier', model)])
pipe.fit(X_train, y_train)
Challenge:
OneHotEncoder is encoding everything including the numerical columns. I want to keep numerical columns as it is and encode only categorical features in an efficient way that's compatible with Pipeline().
encoder = OneHotEncoder(drop='first', sparse=False)
encoder.fit(X_train)
encoder.transform(X_train) # Columns C is encoded - this is what I want to avoid
Work around (not ideal): I can get around the problem using pd.get_dummies(). However, this means I can't include it in my pipeline. Or is there a way?
X_train = pd.get_dummies(X_train, drop_first=True)
My preferred solution for this would be to use sklearn's ColumnTransformer (see here).
It enables you to split the data in as many groups as you want (in your case, categorical vs numerical data) and apply different preprocessing operations to these groups. This transformer can then be used in a pipeline as any other sklearn preprocessing tool. Here is a short example:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
X = pd.DataFrame({"a":[1,2,3],"b":["A","A","B"]})
y = np.array([0,1,1])
OHE = OneHotEncoder()
scaler = StandardScaler()
RFC = RandomForestClassifier()
cat_cols = ["b"]
num_cols = ["a"]
transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),
('num_cols', scaler, num_cols)])
pipe = Pipeline([("preprocessing", transformer),
("classifier", RFC)])
pipe.fit(X,y)
NB: I have taken some license with your request because this only applies the scaler to the numerical data, which I believe makes more sense? If you do want to apply the scaler to all columns, you can do this as well by modifying this example.
What I would do is to create my own custom transformer and put it into pipeline. With this way, you will have a lot of power over the data in your hand. So, the steps are like below:
1) Create a custom transformer class inheriting BaseEstimator and TransformerMixin. In its transform() function try to detect the values of that column is either numerical or categorical. If you do not want to deal with the logic right now, you can always give column name for categorical columns to your transform() function to select on the fly.
2) (Optional) Create your custom transformer to handle columns with only categorical values.
3) (Optional) Create your custom transformer to handle columns with only numerical values.
4) Build two pipelines (one for categorical, the other for numerical) using transformers you created and you can also use the existing ones from sklearn.
5) Merge two pipelines with FeatureUnion.
6) Merge your big pipeline with your ML model.
7) Call fit_transform()
The sample code (no optionals implemented): GitHub Jupyter Noteboook

Scikit-learn Imputer Reducing Dimensions

I have a dataframe with 332 columns. I want to impute values to be able to use scikit-learn's decision tree classifier. My problem is that the column of the resulting data from imputer function is only 330.
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
cols = data.columns
new = imp.fit_transform(data)
print(data.shape,new.shape)
(34132, 332) (34132, 330)
According to the documentation of sklearn.preprocessing.Imputer:
When axis=0, columns which only contained missing values at fit are discarded upon transform.
So, this is removing all-missing-value columns.

Categories