using Imputer in scikit-learn - python

I need to fill the missing temperature value with the mean value of that month using Imputer() in scikit-learn.
First I split the dataframe into groups based on the month. Then I called the imputer function to calculate the mean for that group and fill in the missing values.
Here is the code I wrote but it didn't work:
def impute_missing (data_1_group):
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(data_1_group)
data_1_group=imp.transform(data_1_group['datetime'])
return(data_1_group)
for data_1_group in data_1.groupby(pd.TimeGrouper("M")):
impute_missing(data_1_group)
Any suggestion?

try this small change
imp=imp.fit(data_1_group['datetime'])
data_1_group=imp.transform(data_1_group['datetime'])
Though I m new to scikit myself, I am recommending the solution that worked for me. This is because
1) imp object needs to override to fit, as in the first line
2) it needs to fit and impute the same dataset, which in this case seems to be data_1_group['datetime']
I hope this helps

Related

StandardScaler in Python

I want to standardize 'x_train'.
The first 'x_train' in the picture is the original data set, and the next 'x_train' below the previous one is standardized.
I just want to standardize the first six columns, so I wrote x_train[:,0:6] during standardization.
However, the result of standardization is obviously unreasonable. Moreover, when I use the mean and standard deviation of 'x_train' to standardize x_test, the result went right. It's weird. I have no idea what's wrong with my code.
Below is my code for standardizing.
Try -
scaler = preprocessing.StandardScaler().fit(x_train.iloc[:, 0:6])
#returning the scaled values to a new variable
X_train_first_six = scaler.transform(x_train.iloc[:, 0:6])
X_test_first_six = scaler.transform(x_test.iloc[:, 0:6])
ref. pandas iloc

How can I fill in a column missing 20% in the dataset?

There is a column missing 54% in the dataset. 17031 data is missing in this column. I did not delete it because this column is important to me. I filled it with knn. But because its neighbors are also nan values, some rows are still filled in nan. I changed the number of neighbors 3, I tried 4 and 5 but the result is the same. 12116 lines remain nan. Do you suggest me to wipe the column, do you have any other recommended method?
from sklearn.impute import KNNImputer
df_n = df[["Credit_Score","Annual_Income"]]
var_names = df_n.columns
n_df = np.array(df_n)
imputer = KNNImputer(n_neighbors=3)
new_data = imputer.fit_transform(n_df)
df2=pd.DataFrame(new_data, columns=var_names)
for s in ["Credit_Score","Annual_Income"]:
df[[s]] = df2[s]
You can use sklearn's SimpleImputer (link), which can fill the missing values with the mean, median, or other constant related to the column. This is a simpler imputation strategy than KNN, but it does ensure that no nans are remaining after imputation.

Data imputation in Python for Google Analytics data

I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.

Normalizing all numeric columns in my dataset and compare before and after

I want to normalize all the numeric values in my dataset.
I have taken my whole dataset into a pandas dataframe.
My code to do this so far:
for column in numeric: #numeric=df._get_numeric_data()
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
But how do i verify this is correct though?
I tried plotting a histogram for one of the columns before normalizing and after adding this piece of code before and after my for loop:
x=df['Below.Primary'] #Below.Primary is one of my column names
plt.hist(x, bins=45)
The blue histogram was before the for loop and the orange, after.
My total code looked like this:
ln[21] plt.hist(df['Below.Primary'], bins=45)
ln[22] for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
x=df['Below.Primary']
plt.hist(x, bins=45)
I don't see any reduction in scale. What have i done wrong? If not correct, can someone point out the correct way to do what i wanted to do?
Try use this:
scaler = preprocessing.StandardScaler()
df[col] = scaler.fit_transform(df[col])
A couple general things first.
If numeric is a list of column names (looks like this is the case), the for loop is not necessary.
A Pandas series using an ndarray under the hood so you can just request the ndarray with Series.values instead of calling np.array(). See this page on the Pandas Series.
I am assuming you are using preprocessing from sklearn.
I recommend using sklearn.preprocessing.Normalizer for this.
import pandas as pd
from sklearn.preprocessing import Normalizer
### Without the for loop (recommended)
# this version returns array
normalizer = Normalizer()
normalized_values = normalizer.fit_transform(df[numeric])
# normalized_values is a 2D array which is useful
# for many applications
# to convert back to DataFrame
df = pd.DataFrame(normalized_values, columns = numeric)
### with the for-loop (not recommended)
for column in numeric:
x_array = df[column].values.reshape(-1,1)
df[column] = normalizer.fit_transform(x_array)
You have to set normalized_X to the respective column while iterating.
for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
df[column]= normalized_X #Setting normalized value in the column
x=df['Below.Primary']
plt.hist(x, bins=45)

What is difference between fit, transform and fit_transform in python when using sklearn?

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3]=imputer.transform(X[:, 1:3])
Can you help me know what above code does? I don't know much about Imputer. Kindly help!
The confusing part is fit and transform.
#here fit method will calculate the required parameters (In this case mean)
#and store it in the impute object
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3]=imputer.transform(X[:, 1:3])
#imputer.transform will actually do the work of replacement of nan with mean.
#This can be done in one step using fit_transform
Imputer is used to replace missing values. The fit method calculates the parameters while the fit_transform method changes the data to replace those NaN with the mean and outputs a new matrix X.
# Imports library
from sklearn.preprocessing import Imputer
# Create a new instance of the Imputer object
# Missing values are replaced with NaN
# Missing values are replaced by the mean later on
# The axis determines whether you want to move column or row wise
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
# Fit the imputer to X
imputer = imputer.fit(X[:, 1:3])
# Replace in the original matrix X
# with the new values after the transformation of X
X[:, 1:3]=imputer.transform(X[:, 1:3])
I commented out the code for you, I hope this will make a bit more sense. You need to think of X as a matrix that you have to transform in order to have no more NaN (missing values).
Refer to the documentation for more information.
Your comments tell you the difference. It is saying that if you don't use imputer.fit, you can't do the replacement of nan with some method, for example with mean or median. To apply this process, you need to use imputer.transform after imputer.fit and then, you will have a new dataset without nan values.
See as far as I have understood
import a specific class from the library
from sklearn.preprocessing import Imputer
Creating an object of the class which handles the data according to our personalized data
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
Applying (as in applying a function on a data) to the matrix x
For example let an operator e applied to data d Imputer.fit returns ed imputer = imputer.fit(X[:, 1:3])
Now Imputer.transform computes the value of ed and assigns it to the given matrice
X[:, 1:3]=imputer.transform(X[:, 1:3])

Categories