Missing value imputation in python using KNN - python

I have a dataset that looks like this
1908 January 5.0 -1.4
1908 February 7.3 1.9
1908 March 6.2 0.3
1908 April NaN 2.1
1908 May NaN 7.7
1908 June 17.7 8.7
1908 July NaN 11.0
1908 August 17.5 9.7
1908 September 16.3 8.4
1908 October 14.6 8.0
1908 November 9.6 3.4
1908 December 5.8 NaN
1909 January 5.0 0.1
1909 February 5.5 -0.3
1909 March 5.6 -0.3
1909 April 12.2 3.3
1909 May 14.7 4.8
1909 June 15.0 7.5
1909 July 17.3 10.8
1909 August 18.8 10.7
I want to replace the NaNs using KNN as the method. I looked up sklearns Imputer class but it supports only mean, median and mode imputation. There is a feature request here but I don't think that's been implemented as of now. Any ideas on how to replace the NaNs from the last two columns using KNN?
Edit:
Since I need to run codes in another environment, I don't have the luxury of installing packages. Sklearn, pandas, numpy, and other standard packages are the only ones I can use.

fancyimpute package supports such kind of imputation, using the following API:
from fancyimpute import KNN
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).complete(X_incomplete)
Here are the imputations supported by this package:
•SimpleFill: Replaces missing entries with the mean or median of each
column.
•KNN: Nearest neighbor imputations which weights samples using the
mean squared difference on features for which two rows both have
observed data.
•SoftImpute: Matrix completion by iterative soft thresholding of SVD
decompositions. Inspired by the softImpute package for R, which is
based on Spectral Regularization Algorithms for Learning Large
Incomplete Matrices by Mazumder et. al.
•IterativeSVD: Matrix completion by iterative low-rank SVD
decomposition. Should be similar to SVDimpute from Missing value
estimation methods for DNA microarrays by Troyanskaya et. al.
•MICE: Reimplementation of Multiple Imputation by Chained Equations.
•MatrixFactorization: Direct factorization of the incomplete matrix
into low-rank U and V, with an L1 sparsity penalty on the elements of
U and an L2 penalty on the elements of V. Solved by gradient descent.
•NuclearNormMinimization: Simple implementation of Exact Matrix
Completion via Convex Optimization by Emmanuel Candes and Benjamin
Recht using cvxpy. Too slow for large matrices.
•BiScaler: Iterative estimation of row/column means and standard
deviations to get doubly normalized matrix. Not guaranteed to converge
but works well in practice. Taken from Matrix Completion and Low-Rank
SVD via Fast Alternating Least Squares.

fancyimpute's KNN imputation no more supports the complete function as suggested by other answer, we need to now use fit_transform
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).fit_transform(X_incomplete)
reference https://github.com/iskandr/fancyimpute

scikit-learn v0.22 supports native KNN Imputation
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))

This pull request to sklearn adds KNN support. You can get the code from here.

Related

Fitting model with NaN values in the features

I would know if there is a method for fitting a model even some features contains some NaN values.
X
Feature1 Feature2 Feature3 Feature4 Feature5
0 0.1 NaN 0.3 NaN 4.0
1 4.0 6.0 6.6 99.0 2.0
2 11.0 15.0 2.2 3.3 NaN
3 1.0 6.0 2.0 2.5 4.0
4 5.0 11.2 NaN 3.0 NaN
Code
model = LogisticRegression()
model.fit(X_train, y_train)
Error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Usually, tree-based classifiers can handle NaNs as they just split the dataset based on the feature values. Of course, it also depends on how the algorithm is implemented.
I am not sure about sklearn but if you really want to classify them while preserving the NaN values, your best choice is to use XGBoost. It is not on sklearn but there are very good libraries and they are easy to use as well. It is also one of the most powerful classifiers, so you should definitely try it!
https://xgboost.readthedocs.io/en/latest/python/python_intro.html
You can use a SimpleImputer() to replace nan by the mean value, or a constant prior to fitting the model. Have a look at the documentation to find the correct strategy that work for your usecase.
In your case if you want to have still have nan value and take them out of the equation, you can simply replace nan by 0 using SimpleImputer(strategy='constant', fill_value=0)
As follows:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
model = make_pipeline(
SimpleImputer(strategy='constant', fill_value=0),
LinearRegression()
)
model.fit(X, y)
Note: I am using here a pipeline to all the steps in one go.

How to calculate weighted mean and median in python?

I have data in pandas DataFrame or NumPy array and want to calculate the weighted mean(average) or weighted median based on some weights in another column or array. I am looking for a simple solution rather than writing functions from scratch or copy-paste them everywhere I need them.
The data looks like this -
state.head()
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
And I want to calculate the weighted mean or median of murder rate which takes into account the different populations in the states.
How can I do that?
First, install the weightedstats library in python.
pip install weightedstats
Then, do the following -
Weighted Mean
ws.weighted_mean(state['Murder.Rate'], weights=state['Population'])
4.445833981123394
Weighted Median
ws.weighted_median(state['Murder.Rate'], weights=state['Population'])
4.4
It also has special weighted mean and median methods to use with numpy arrays. The above methods will work but in case if you need it.
my_data = [1, 2, 3, 4, 5]
my_weights = [10, 1, 1, 1, 9]
ws.numpy_weighted_mean(my_data, weights=my_weights)
ws.numpy_weighted_median(my_data, weights=my_weights)

Pandas: What determines index order when creating a frame from a dict?

I'm wondering why the index in the below data frame is not sorted when created via a nested dict of dicts? Am expecting that the row containing year 2000 data would be the first row followed by the rows for 2001 and 2002 respectively. I also realize that I can run frame.sort_index() to obtain the desired results but just wondering why it doesn't happen automatically.
In [1]: import pandas as pd
In [2]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
...: ...: 'Ohio': {2000: 1.5, 2001: 1.7,2002: 3.6}}
In [3]: frame = pd.DataFrame(pop)
In [4]: frame
Out[4]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
The above was produced with Python 3.8.3 and iPython 7.18.1 and the example comes from chapter 5 of Python for Data Analysis by Wes McKinney (the index is sorted in the book).
I think a good way to understand what is going on is to try with the order of the state flipped:
pop = {'Ohio': {2000: 1.5, 2001: 1.7,2002: 3.6}, 'Nevada': {2001: 2.4, 2002: 2.9}}
Now you get:
Ohio Nevada
2000 1.5 NaN
2001 1.7 2.4
2002 3.6 2.9
So what happened in the original? It goes through Nevada first which just has index of 2001 and 2002. THEN it goes through Ohio which has a new index (2000) which is added to the bottom, and two old indexes (2001 and 2002) which already exist and so the values are added in the appropriate spots.
As why it shows up in the book sorted, it is probably a Pandas version difference. Modern Pandas (post v0.25. See Docs) maintains the key order as specified. The book was probably written for an older version of pandas which happened to (randomly) use Ohio first.
Let us try concat after we create the input key by key
out = pd.concat([pd.Series(x) for x in pop.values()],axis=1,keys=pop.keys())
Out[50]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

Impute values of a vector using Cosine similarity in Python

The Scenario
I have a Dataset whose last column has NaN values in it, which need to be imputed using only Vector Cosine & Pearson Correlation; after which the data will be further taken for Clustering.
The Problem
It is mandatory for my case to use VECTOR COSINE and PEARSON CORELATION.
Here's a chunk of how my dataset is
post_df1 which is taken from csv using pandas
uid iid rat
1 303.0 785.0 3.000000
2 291.0 1042.0 4.000000
3 234.0 1184.0 2.000000
4 102.0 768.0 2.000000
254 944.0 170.0 5.000000
255 944.0 171.0 5.000000
256 944.0 172.0 NaN
257 944.0 173.0 NaN
258 944.0 174.0 NaN
Which is now taken into a Vector (Just to make it easy, suggestions required) using this command
vect_1 = post_df1.iloc[:, 2].values
Yet with sklearn.preprocessing's Class called Imputer are having Mean, Median & Most frequent methods available, but won't work according to my Scenario.
Questions
Is there any other Package than SurPRISE (by Nicholas Hug), for Vector Cosine & Pearson mehtod
Is it possible to pass a function / method in sklearn for cosine & pearson?
Any other method / way out?
Cosine silirality and Pearson correlation are only parameters in the imputation method, not imputation method. There are various methods of imputation, such as KNN, MICE, SVD and Matrix Factorization. For example, it is possible to use cosine silirality as a parameter of one KNN of the imputation method, but its implementation itself could not be found. fancyimpute package may be helpful as a package with a near implementation. The following is the link. GitHub - hammerlab / fancyimpute: Multivariate imputation and matrix completion algorithms implemented in Python https://github.com/hammerlab/fancyimpute/

Replace NaN or missing values with rolling mean or other interpolation

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using
pd.rolling_mean(data["variable"]), 12, center=True)
but it just gives me all NaN values.
Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.
The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.
There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.
df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
'temp' : [65,50,45,np.nan,40,43] }).set_index('month')
You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in #user394430's answer.)
df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3'] = df['temp'].rolling( 3,center=True,min_periods=1).mean()
Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).
df['update'] = df['rollmean3']
df['update'].update( df['temp'] ) # note: this is an inplace operation
There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.
df['ffill'] = df['temp'].ffill() # previous month
df['bfill'] = df['temp'].bfill() # next month
df['interp'] = df['temp'].interpolate() # mean of prev/next
In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question:
Interpolation on DataFrame in pandas
Here is the sample data with all the results:
temp rollmean12 rollmean3 update ffill bfill interp
month
10 65.0 48.6 57.500000 65.0 65.0 65.0 65.0
11 50.0 48.6 53.333333 50.0 50.0 50.0 50.0
12 45.0 48.6 47.500000 45.0 45.0 45.0 45.0
1 NaN 48.6 42.500000 42.5 45.0 40.0 42.5
2 40.0 48.6 41.500000 40.0 40.0 40.0 40.0
3 43.0 48.6 41.500000 43.0 43.0 43.0 43.0
In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.
The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be
data["variable"].rolling(min_periods=1, center=True, window=12).mean().

Categories