I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base.
Here is my Python code that I used:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
class CustomScaler(BaseEstimator,TransformerMixin):
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
self.scaler = StandardScaler(copy,with_mean,with_std)
self.columns = columns
self.mean_ = None
self.var_ = None
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
def transform(self, X, y=None, copy=None):
init_col_order = X.columns
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
X=new_df_upsampled.copy()
X.drop('income',axis=1,inplace=True)
continuous = df.iloc[:, np.r_[0,2,10:13]]
#basically independent variables that I consider continuous
columns_to_scale = continuous
scaler = CustomScaler(columns_to_scale)
scaler.fit(X)
However when I tried to run the scaler, I met this problem:
So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?
Thank you!
There is no need to create a custom transformer for this problematic. as this operation can be performed using ColumnTransformer. This transformer allows different columns of the input to be transformed separately.
The example below is scaling the columns ['A', 'B'] without changing the column C.
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({'A': np.arange(10),
'B': np.arange(10),
'C': np.arange(10)})
transformer = make_column_transformer(
(StandardScaler(), ['A', 'B']),
remainder='passthrough'
)
pd.DataFrame(transformer.fit_transform(df), columns=df.columns)
This output the following result:
A B C
0 -1.566699 -1.566699 0.0
1 -1.218544 -1.218544 1.0
2 -0.870388 -0.870388 2.0
3 -0.522233 -0.522233 3.0
4 -0.174078 -0.174078 4.0
5 0.174078 0.174078 5.0
6 0.522233 0.522233 6.0
7 0.870388 0.870388 7.0
8 1.218544 1.218544 8.0
9 1.566699 1.566699 9.0
I agree with #AntoineDubuis, that ColumnTransformer is a better (builtin!) way to do this. That said, I'd like to address where your code goes wrong.
In fit, you have self.scaler.fit(X[self.columns], y); this indicates that self.columns should be a list of column names (or a few other options). But you've defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]], which is a dataframe.
A couple other issues:
you should only set attributes in __init__ that come from its signature, or cloning will fail. Move self.scaler
to fit, and save its parameters copy etc. directly at __init__. Don't initialize mean_ or var_.
you never actually use mean_ or var_. You can keep them if you want, but the relevant statistics are stored in the scaler object.
Related
I have a dataframe that looks like this (it is obviously much bigger):
id points isAvailable frequency Score
abc1 325 True 93.0 0.01
def2 467 False 80.1 0.59
ghi3 122 True 90.3 1
jkl4 546 True 84.0 0
mno5 355 False 93.5 0.99
I want to see how much the features points, isAvailable and frequency influence the Score. I want to use Random Forests:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
X = df
y = df['Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
I get the following error: ValueError: could not convert string to float: 'abc1'
Questions:
How can I pre-process the data? What happens to the boolean variables?
Is it wrong to even include the id column in X?
I was thinking of using something like df = df.astype({"a": int, "b": complex}) but I don't really know how in this case and I read that there are special algorithms for encoding.
First, you have to remove the score column from the X dataset: it is the label of you data, so it should not be used as a feature.
Second, assuming that the id column is an identifier for you data, you should remove it from X. It is like if you were trying to analyze a dataset of weight of a group of persons: you would remove their names because there is no correlation between their names and their weight.
Last, to deal with the boolean variables, there are some encoding methods, like you said (for example this one), but since the value can be only 0 or 1, it should be fine if you convert False = 0, True = 1
You can do it with this code (assuming df is the name of you DataFrame):
df['isAvailable'] = (df['isAvailable'] == True).astype(int)
Based on the guide Implementing PCA in Python, by Sebastian Raschka I am building the PCA algorithm from scratch for my research purpose. The class definition is:
import numpy as np
class PCA(object):
"""Dimension Reduction using Principal Component Analysis (PCA)
It is the procces of computing principal components which explains the
maximum variation of the dataset using fewer components.
:type n_components: int, optional
:param n_components: Number of components to consider, if not set then
`n_components = min(n_samples, n_features)`, where
`n_samples` is the number of samples, and
`n_features` is the number of features (i.e.,
dimension of the dataset).
Attributes
==========
:type covariance_: np.ndarray
:param covariance_: Coviarance Matrix
:type eig_vals_: np.ndarray
:param eig_vals_: Calculated Eigen Values
:type eig_vecs_: np.ndarray
:param eig_vecs_: Calculated Eigen Vectors
:type explained_variance_: np.ndarray
:param explained_variance_: Explained Variance of Each Principal Components
:type cum_explained_variance_: np.ndarray
:param cum_explained_variance_: Cumulative Explained Variables
"""
def __init__(self, n_components : int = None):
"""Default Constructor for Initialization"""
self.n_components = n_components
def fit_transform(self, X : np.ndarray):
"""Fit the PCA algorithm into the Dataset"""
if not self.n_components:
self.n_components = min(X.shape)
self.covariance_ = np.cov(X.T)
# calculate eigens
self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)
# explained variance
_tot_eig_vals = sum(self.eig_vals_)
self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
self.cum_explained_variance_ = np.cumsum(self.explained_variance_)
# define `W` as `d x k`-dimension
self.W_ = self.eig_vecs_[:, :self.n_components]
print(X.shape, self.W_.shape)
return X.dot(self.W_)
Consider the iris-dataset as a test case, PCA is achieved and visualized as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)
# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()
The output is as:
Now, I wanted to verify the output, for which I used sklearn library, and the output is as follows:
from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components
principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change
sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()
I don't understand why the output is oriented differently, with a minor different value. I studied numerous codes [1, 2, 3], all of which have the same issue. My questions:
What is different in sklearn, that the plot is different? I've tried with a different dataset too - the same problem.
Is there a way to fix this issue?
I was not able to study the sklearn.decompose.PCA algorithm, as I am new to OOPs concept with python.
Output in the blog post by Sebastian Raschka also has a minor variation in output. Figure below:
When calculating an eigenvector you may change its sign and the solution will also be a valid one.
So any PCA axis can be reversed and the solution will be valid.
Nevertheless, you may wish to impose a positive correlation of a PCA axis with one of the original variables in the dataset, inverting the axis if needed.
The difference in values comes from PCA from sklearn using svd decomposition. In sklearn there's a function svd_flip used to flip the PCs, which explains why you see this flip
More details on the help page:
It uses the LAPACK implementation of the full SVD or a randomized
truncated SVD by the method of Halko et al. 2009, depending on the
shape of the input data and the number of components to extract.
You can read about the relation here
We first run your example dataset:
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy
iris = load_iris()
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)
n_components = 4
sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))
We now perform SVD on your centered matrix:
U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)
svdPCs = pd.DataFrame(U*S)
The results:
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
svdPCs
0 1 2 3
0 -0.630703 0.107578 -0.018719 -0.007307
1 -0.622905 -0.104260 -0.049142 -0.032359
2 -0.669520 -0.051417 0.019644 -0.007434
3 -0.654153 -0.102885 0.023219 0.020114
4 -0.648788 0.133488 0.015116 0.011786
.. ... ... ... ...
145 0.551462 0.059841 0.086283 -0.110092
146 0.407146 -0.171821 -0.004102 -0.065241
147 0.447143 0.037560 0.049546 -0.032743
148 0.488208 0.149678 0.239209 0.002864
149 0.312066 -0.031130 0.118672 0.052505
You can implement without the flip. The values will be the same and your PCA will be valid as noted in the other answer.
I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()
It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.
Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding. A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..
You can get the column names using .get_feature_names() attribute.
>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()
Detailed example is here.
Update
from Version 1.0, use get_feature_names_out
This example could help for future readers:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>>
Sex AgeGroup
0 male 0
1 female 15
2 male 30
3 female 45
4 male 60
5 female 75
encoder=OneHotEncoder(sparse=False)
train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))
train_X_encoded.columns = encoder.get_feature_names(['Sex'])
train_X.drop(['Sex'] ,axis=1, inplace=True)
OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)
>>>
AgeGroup Sex_female Sex_male
0 0 0.0 1.0
1 15 1.0 0.0
2 30 0.0 1.0
3 45 1.0 0.0
4 60 0.0 1.0
5 75 1.0 0.0`
Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base
I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
self.feature_names = []
def fit(self, X, y=None):
return self
def transform(self, X):
result = pd.get_dummies(X)
self.feature_names = result.columns
return result
A bit basic I know but it does the job I need it to.
If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.
a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names
df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)
There is another easy way with the package category_encoders this method uses a pipeline which also is one of the data science best practices.
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
ohe = OneHotEncoder(use_cat_names=True)
ohe.fit_transform(X)
Update: based on the answer of #Venkatachalam, the method get_feature_names() has been deprecated in scikit-learn 1.0. You will get a warning when trying to run it. Instead, use get_feature_names_out():
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
ohenc = OneHotEncoder(sparse=False)
x_cat_df = pd.DataFrame(ohenc.fit_transform(xtrain_lbl))
x_cat_df.columns = ohenc.get_feature_names_out(input_features=xtrain_lbl.columns)
Setting the parameter sparse=False in OneHotEncoder() will return an array instead of sparse matrix, so you don't need to convert it later. fit_transform() will calculate the parameters and transform the training set in one line.
Source: OneHotEncoder documentation
I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want.
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features = autoscaler.fit_transform(features)
A "solution" I found online is:
features = features.apply(lambda x: autoscaler.fit_transform(x))
It appears to work, but leads to a deprecationwarning:
/usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
I therefore tried:
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1)))
But this gives:
Traceback (most recent call last): File "./analyse.py", line 91, in
features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1))) File
"/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 3972, in
apply
return self._apply_standard(f, axis, reduce=reduce) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 4081, in
_apply_standard
result = self._constructor(data=results, index=index) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 226, in
init
mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 363, in
_init_dict
dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5163, in
_arrays_to_mgr
arrays = _homogenize(arrays, index, dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5477, in
_homogenize
raise_cast_failure=False) File "/usr/lib/python3.5/site-packages/pandas/core/series.py", line 2885,
in _sanitize_array
raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional
How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.
You could convert the DataFrame as a numpy array using as_matrix(). Example on a random dataset:
Edit:
Changing as_matrix() to values, (it doesn't change the result) per the last sentence of the as_matrix() docs above:
Generally, it is recommended to use ‘.values’.
import pandas as pd
import numpy as np #for the random integer example
df = pd.DataFrame(np.random.randint(0.0,100.0,size=(10,4)),
index=range(10,20),
columns=['col1','col2','col3','col4'],
dtype='float64')
Note, indices are 10-19:
In [14]: df.head(3)
Out[14]:
col1 col2 col3 col4
10 3 38 86 65
11 98 3 66 68
12 88 46 35 68
Now fit_transform the DataFrame to get the scaled_features array:
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(df.values)
In [15]: scaled_features[:3,:] #lost the indices
Out[15]:
array([[-1.89007341, 0.05636005, 1.74514417, 0.46669562],
[ 1.26558518, -1.35264122, 0.82178747, 0.59282958],
[ 0.93341059, 0.37841748, -0.60941542, 0.59282958]])
Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names:
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
In [17]: scaled_features_df.head(3)
Out[17]:
col1 col2 col3 col4
10 -1.890073 0.056360 1.745144 0.466696
11 1.265585 -1.352641 0.821787 0.592830
12 0.933411 0.378417 -0.609415 0.592830
Edit 2:
Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed.
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(df.columns, StandardScaler())])
scaled_features = mapper.fit_transform(df.copy(), 4)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('your file here')
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df),columns = df.columns)
The df_scaled will be the 'same' dataframe, only now with the scaled values
Reassigning back to df.values preserves both index and columns.
df.values[:] = StandardScaler().fit_transform(df)
features = ["col1", "col2", "col3", "col4"]
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])
This worked with MinMaxScaler in getting back the array values to original dataframe. It should work on StandardScaler as well.
data_scaled = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)
where, data_scaled is the new data frame, scaled_features = the array post normalization, df = original dataframe for which we need the index and columns back.
Works for me:
from sklearn.preprocessing import StandardScaler
cols = list(train_df_x_num.columns)
scaler = StandardScaler()
train_df_x_num[cols] = scaler.fit_transform(train_df_x_num[cols])
This is what I did:
X.Column1 = StandardScaler().fit_transform(X.Column1.values.reshape(-1, 1))
Since sklearn Version 1.2, estiamtors can return a DataFrame keeping the column names.
set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas")
See Release Highlights for scikit-learn 1.2 - Pandas output with set_output API
Example for set_output():
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")
Example for set_config():
from sklearn import set_config
set_config(transform_output="pandas")
You can mix multiple data types in scikit-learn using Neuraxle:
Option 1: discard the row names and column names
from neuraxle.pipeline import Pipeline
from neuraxle.base import NonFittableMixin, BaseStep
class PandasToNumpy(NonFittableMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
return data_inputs.values
pipeline = Pipeline([
PandasToNumpy(),
StandardScaler(),
])
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
pipeline, scaled_features = pipeline.fit_transform(features)
Option 2: to keep the original column names and row names
You could even do this with a wrapper as such:
from neuraxle.pipeline import Pipeline
from neuraxle.base import MetaStepMixin, BaseStep
class PandasValuesChangerOf(MetaStepMixin, BaseStep):
def transform(self, data_inputs, expected_outputs):
new_data_inputs = self.wrapped.transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return new_data_inputs
def fit_transform(self, data_inputs, expected_outputs):
self.wrapped, new_data_inputs = self.wrapped.fit_transform(data_inputs.values)
new_data_inputs = self._merge(data_inputs, new_data_inputs)
return self, new_data_inputs
def _merge(self, data_inputs, new_data_inputs):
new_data_inputs = pd.DataFrame(
new_data_inputs,
index=data_inputs.index,
columns=data_inputs.columns
)
return new_data_inputs
df_scaler = PandasValuesChangerOf(StandardScaler())
Then, you proceed as you intended:
features = df[["col1", "col2", "col3", "col4"]] # ... your df data
df_scaler, scaled_features = df_scaler.fit_transform(features)
You can try this code, this will give you a DataFrame with indexes
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston # boston housing dataset
dt= load_boston().data
col= load_boston().feature_names
# Make a dataframe
df = pd.DataFrame(data=dt, columns=col)
# define a method to scale data, looping thru the columns, and passing a scaler
def scale_data(data, columns, scaler):
for col in columns:
data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
return data
# specify a scaler, and call the method on boston data
scaler = StandardScaler()
df_scaled = scale_data(df, col, scaler)
# view first 10 rows of the scaled dataframe
df_scaled[0:10]
You could directly assign a numpy array to a data frame by using slicing.
from sklearn.preprocessing import StandardScaler
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features[:] = autoscaler.fit_transform(features.values)