How to transform some columns only with SimpleImputer or equivalent - python

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.
I have read carefully the documentation but I still cannot figure out how to achieve this.
To make this more specific, let's say I have:
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?
An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?

There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well
For the second column, use
column.fillna(column.mean(), inplace=True)
For the third column, use
column.fillna(constant, inplace=True)
Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.
Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be
column = column.fillna(column.mean())

Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')
print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
# [4 3.5 6]
# [10 5.0 29]]
This approach helps with constructing pipelines which are more suitable for larger applications.

This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).
#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and
df[cname].dtype == "object"]
Why thes columns, it's recommanded, to encode only cols with cardinality near 10.
# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols] = imp.fit_transform(df[low_cardinality_cols])
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
df[col] = label_encoder.fit_transform(df[col])
```

I am assuming you have your data as a pandas dataframe.
In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.
An example of this is,
## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))
Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

Related

How to substitute Scaled column with usual columns in my dataframe

I have scaled columns, however how do I put them back into my data frame?
Here is the code that I have:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
num_cols = ['fare_amount','trip_distance','jfk_drop_distance','lga_drop_distance','ewr_drop_distance','met_drop_distance','wtc_drop_distance']
features = train_df[num_cols]
ct = ColumnTransformer([('scaler',
StandardScaler(),
['fare_amount','trip_distance','jfk_drop_distance','lga_drop_distance','ewr_drop_distance','met_drop_distance','wtc_drop_distance'])]
,remainder='passthrough')
ct.fit_transform(features)
My main data frame that I want to substite this columns with old one is train_data
I think, U seem to be almost close.
just put the fit_transform data to your dataframe like below:
...
train_df[num_cols] = ct.fit_transform(features)
...

How can I fill in a column missing 20% in the dataset?

There is a column missing 54% in the dataset. 17031 data is missing in this column. I did not delete it because this column is important to me. I filled it with knn. But because its neighbors are also nan values, some rows are still filled in nan. I changed the number of neighbors 3, I tried 4 and 5 but the result is the same. 12116 lines remain nan. Do you suggest me to wipe the column, do you have any other recommended method?
from sklearn.impute import KNNImputer
df_n = df[["Credit_Score","Annual_Income"]]
var_names = df_n.columns
n_df = np.array(df_n)
imputer = KNNImputer(n_neighbors=3)
new_data = imputer.fit_transform(n_df)
df2=pd.DataFrame(new_data, columns=var_names)
for s in ["Credit_Score","Annual_Income"]:
df[[s]] = df2[s]
You can use sklearn's SimpleImputer (link), which can fill the missing values with the mean, median, or other constant related to the column. This is a simpler imputation strategy than KNN, but it does ensure that no nans are remaining after imputation.

Data imputation in Python for Google Analytics data

I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.

What is difference between fit, transform and fit_transform in python when using sklearn?

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3]=imputer.transform(X[:, 1:3])
Can you help me know what above code does? I don't know much about Imputer. Kindly help!
The confusing part is fit and transform.
#here fit method will calculate the required parameters (In this case mean)
#and store it in the impute object
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3]=imputer.transform(X[:, 1:3])
#imputer.transform will actually do the work of replacement of nan with mean.
#This can be done in one step using fit_transform
Imputer is used to replace missing values. The fit method calculates the parameters while the fit_transform method changes the data to replace those NaN with the mean and outputs a new matrix X.
# Imports library
from sklearn.preprocessing import Imputer
# Create a new instance of the Imputer object
# Missing values are replaced with NaN
# Missing values are replaced by the mean later on
# The axis determines whether you want to move column or row wise
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
# Fit the imputer to X
imputer = imputer.fit(X[:, 1:3])
# Replace in the original matrix X
# with the new values after the transformation of X
X[:, 1:3]=imputer.transform(X[:, 1:3])
I commented out the code for you, I hope this will make a bit more sense. You need to think of X as a matrix that you have to transform in order to have no more NaN (missing values).
Refer to the documentation for more information.
Your comments tell you the difference. It is saying that if you don't use imputer.fit, you can't do the replacement of nan with some method, for example with mean or median. To apply this process, you need to use imputer.transform after imputer.fit and then, you will have a new dataset without nan values.
See as far as I have understood
import a specific class from the library
from sklearn.preprocessing import Imputer
Creating an object of the class which handles the data according to our personalized data
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
Applying (as in applying a function on a data) to the matrix x
For example let an operator e applied to data d Imputer.fit returns ed imputer = imputer.fit(X[:, 1:3])
Now Imputer.transform computes the value of ed and assigns it to the given matrice
X[:, 1:3]=imputer.transform(X[:, 1:3])

how to apply preprocessing methods on several columns at one time in sklearn

My question is I have so many columns in my pandas data frame and I am trying to apply the sklearn preprocessing using dataframe mapper from sklearn-pandas library such as
mapper= DataFrameMapper([
('gender',sklearn.preprocessing.LabelBinarizer()),
('gradelevel',sklearn.preprocessing.LabelEncoder()),
('subject',sklearn.preprocessing.LabelEncoder()),
('districtid',sklearn.preprocessing.LabelEncoder()),
('sbmRate',sklearn.preprocessing.StandardScaler()),
('pRate',sklearn.preprocessing.StandardScaler()),
('assn1',sklearn.preprocessing.StandardScaler()),
('assn2',sklearn.preprocessing.StandardScaler()),
('assn3',sklearn.preprocessing.StandardScaler()),
('assn4',sklearn.preprocessing.StandardScaler()),
('assn5',sklearn.preprocessing.StandardScaler()),
('attd1',sklearn.preprocessing.StandardScaler()),
('attd2',sklearn.preprocessing.StandardScaler()),
('attd3',sklearn.preprocessing.StandardScaler()),
('attd4',sklearn.preprocessing.StandardScaler()),
('attd5',sklearn.preprocessing.StandardScaler()),
('sbm1',sklearn.preprocessing.StandardScaler()),
('sbm2',sklearn.preprocessing.StandardScaler()),
('sbm3',sklearn.preprocessing.StandardScaler()),
('sbm4',sklearn.preprocessing.StandardScaler()),
('sbm5',sklearn.preprocessing.StandardScaler())
])
I am just wondering whether there is another more succinct way for me to preprocess many variables at one time without writing them out explicitly.
Another thing that I found a little bit annoying is when I transformed all the pandas data frame into arrays which sklearn can work with, they will lose the column name features, which makes the selection very difficult. Does anyone knows how to preserve the column names as the key when change the pandas data frames to np arrays?
Thank you so much!
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper
encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
[('gender', LabelBinarizer())] +
[(encoder, LabelEncoder()) for encoder in encoders] +
[(scalar, StandardScaler()) for scalar in scalars]
)
If you're doing this a lot, you could even write your own function:
mapper = data_frame_mapper(binarizers=['gender'],
encoders=['gradelevel', 'subject', 'districtid'],
scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])

Categories