How can I fill in a column missing 20% in the dataset? - python

There is a column missing 54% in the dataset. 17031 data is missing in this column. I did not delete it because this column is important to me. I filled it with knn. But because its neighbors are also nan values, some rows are still filled in nan. I changed the number of neighbors 3, I tried 4 and 5 but the result is the same. 12116 lines remain nan. Do you suggest me to wipe the column, do you have any other recommended method?
from sklearn.impute import KNNImputer
df_n = df[["Credit_Score","Annual_Income"]]
var_names = df_n.columns
n_df = np.array(df_n)
imputer = KNNImputer(n_neighbors=3)
new_data = imputer.fit_transform(n_df)
df2=pd.DataFrame(new_data, columns=var_names)
for s in ["Credit_Score","Annual_Income"]:
df[[s]] = df2[s]

You can use sklearn's SimpleImputer (link), which can fill the missing values with the mean, median, or other constant related to the column. This is a simpler imputation strategy than KNN, but it does ensure that no nans are remaining after imputation.

Related

How can I change Nan values to min value of given data in python

I want to change nan values with the min value of data in Python . But I need to do country match.
Here is my part of data
There are only nan values in daily_vaccinations column.
I want to see minimum number of vaccined Argentinian instead of Nan. Also this number should change accordingly to the countries. I mean there should be the number of minimum Belgian vaccined instead of Belgium column.
Use the imputer from Sklearn and Numpy, in the missing_values parameter in the function specify the nan as shown here.
from sklearn.impute import SimpleImputer
import numpy as np
imputer = SimpleImputer(missing_values = np.nan, strategy= 'constant', fill_value = your_value)
# Then you can use the imputer like this
df[["my_column"]]=imputer.fit_transform(df[["my_column"]])
Also you can get the minimum value by this command:
min_value = df['my_column'].min()

Label encode then impute missing then inverse encoding

I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000
Name 0.000000
Age 0.018653
Gender 0.000640
Race 0.317429
Date 0.000000
City 0.000320
State 0.000000
Manner_of_death 0.000000
Armed 0.454487
Mental_illness 0.000000
Flee 0.000000
dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
'Mental_illness', 'Flee'],
dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf:
print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4
State 0
Manner_of_death 0
Armed 5677
Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0,
0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps?
Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.
A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier.
Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.
Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.
Edit: Below is an example of what I mean, using dummy data.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier
# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)
# convert to fake categorical data
features_og = (features_og*10).astype(int)
# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
for j in range(n_features):
if np.random.random() > 0.85:
features[i,j] = -99
# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):
# do train test split based on whether the selected column value is -99.
train = features[np.where(features[:,j] != -99)]
test = features[np.where(features[:,j] == -99)]
clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
# potentially better for categorical features is CatBoost:
#clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
# train the classifier to predict the value of column j using the other columns
clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
# predict values for elements of column j that have the missing flag
preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
# substitute the missing values in column j with the predicted values
features_fixed[np.where(features[:,j] == -99.),j] = preds
Your approach of encoding categorical values first and then imputing missing values is prone to problems and thus, not recommended.
Some imputing strategies, like IterativeImputer, will not guarantee that the output contains only previously known numeric values . This can result in imputed values which are unknown to the encoder and will cause an error upon the inverse transformation (which is exactly your case).
It is better to first impute the missing values for both, numeric and categorical features, and then encode the categorical features. One option would be to use SimpleImputer and replacing missing values with the most frequent category or a new constant value.
Also, a note on LabelEncoder: it is clearly mentioned in its documentation that:
This transformer should be used to encode target values, i.e. y, and not the input X.
If you insist on an encoding strategy like LabelEncoder, you can use OrdinalEncoder which does the same but is actually meant for feature encoding. However, you should be aware that such an encoding strategy might falsely suggest an ordinal relationship between each category of a feature, which might lead to undesired consequences. You should therefore consider other encoding strategies as well.
The entire process can be automated with the datawig package.You just need to create an imputation model for each to-be-imputed column and it will handle the encoding and inverse encoding by itself.
It was even tested against kNN and iterative imputer and showed better results.
Here is a personal guide.

Data imputation in Python for Google Analytics data

I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
Example
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.

Normalizing all numeric columns in my dataset and compare before and after

I want to normalize all the numeric values in my dataset.
I have taken my whole dataset into a pandas dataframe.
My code to do this so far:
for column in numeric: #numeric=df._get_numeric_data()
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
But how do i verify this is correct though?
I tried plotting a histogram for one of the columns before normalizing and after adding this piece of code before and after my for loop:
x=df['Below.Primary'] #Below.Primary is one of my column names
plt.hist(x, bins=45)
The blue histogram was before the for loop and the orange, after.
My total code looked like this:
ln[21] plt.hist(df['Below.Primary'], bins=45)
ln[22] for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
x=df['Below.Primary']
plt.hist(x, bins=45)
I don't see any reduction in scale. What have i done wrong? If not correct, can someone point out the correct way to do what i wanted to do?
Try use this:
scaler = preprocessing.StandardScaler()
df[col] = scaler.fit_transform(df[col])
A couple general things first.
If numeric is a list of column names (looks like this is the case), the for loop is not necessary.
A Pandas series using an ndarray under the hood so you can just request the ndarray with Series.values instead of calling np.array(). See this page on the Pandas Series.
I am assuming you are using preprocessing from sklearn.
I recommend using sklearn.preprocessing.Normalizer for this.
import pandas as pd
from sklearn.preprocessing import Normalizer
### Without the for loop (recommended)
# this version returns array
normalizer = Normalizer()
normalized_values = normalizer.fit_transform(df[numeric])
# normalized_values is a 2D array which is useful
# for many applications
# to convert back to DataFrame
df = pd.DataFrame(normalized_values, columns = numeric)
### with the for-loop (not recommended)
for column in numeric:
x_array = df[column].values.reshape(-1,1)
df[column] = normalizer.fit_transform(x_array)
You have to set normalized_X to the respective column while iterating.
for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
df[column]= normalized_X #Setting normalized value in the column
x=df['Below.Primary']
plt.hist(x, bins=45)

How to transform some columns only with SimpleImputer or equivalent

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.
I have read carefully the documentation but I still cannot figure out how to achieve this.
To make this more specific, let's say I have:
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?
An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?
There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well
For the second column, use
column.fillna(column.mean(), inplace=True)
For the third column, use
column.fillna(constant, inplace=True)
Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.
Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be
column = column.fillna(column.mean())
Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')
print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
# [4 3.5 6]
# [10 5.0 29]]
This approach helps with constructing pipelines which are more suitable for larger applications.
This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).
#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and
df[cname].dtype == "object"]
Why thes columns, it's recommanded, to encode only cols with cardinality near 10.
# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols] = imp.fit_transform(df[low_cardinality_cols])
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
df[col] = label_encoder.fit_transform(df[col])
```
I am assuming you have your data as a pandas dataframe.
In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.
An example of this is,
## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))
Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

Categories