The Scenario
I have a Dataset whose last column has NaN values in it, which need to be imputed using only Vector Cosine & Pearson Correlation; after which the data will be further taken for Clustering.
The Problem
It is mandatory for my case to use VECTOR COSINE and PEARSON CORELATION.
Here's a chunk of how my dataset is
post_df1 which is taken from csv using pandas
uid iid rat
1 303.0 785.0 3.000000
2 291.0 1042.0 4.000000
3 234.0 1184.0 2.000000
4 102.0 768.0 2.000000
254 944.0 170.0 5.000000
255 944.0 171.0 5.000000
256 944.0 172.0 NaN
257 944.0 173.0 NaN
258 944.0 174.0 NaN
Which is now taken into a Vector (Just to make it easy, suggestions required) using this command
vect_1 = post_df1.iloc[:, 2].values
Yet with sklearn.preprocessing's Class called Imputer are having Mean, Median & Most frequent methods available, but won't work according to my Scenario.
Questions
Is there any other Package than SurPRISE (by Nicholas Hug), for Vector Cosine & Pearson mehtod
Is it possible to pass a function / method in sklearn for cosine & pearson?
Any other method / way out?
Cosine silirality and Pearson correlation are only parameters in the imputation method, not imputation method. There are various methods of imputation, such as KNN, MICE, SVD and Matrix Factorization. For example, it is possible to use cosine silirality as a parameter of one KNN of the imputation method, but its implementation itself could not be found. fancyimpute package may be helpful as a package with a near implementation. The following is the link. GitHub - hammerlab / fancyimpute: Multivariate imputation and matrix completion algorithms implemented in Python https://github.com/hammerlab/fancyimpute/
Related
I would know if there is a method for fitting a model even some features contains some NaN values.
X
Feature1 Feature2 Feature3 Feature4 Feature5
0 0.1 NaN 0.3 NaN 4.0
1 4.0 6.0 6.6 99.0 2.0
2 11.0 15.0 2.2 3.3 NaN
3 1.0 6.0 2.0 2.5 4.0
4 5.0 11.2 NaN 3.0 NaN
Code
model = LogisticRegression()
model.fit(X_train, y_train)
Error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Usually, tree-based classifiers can handle NaNs as they just split the dataset based on the feature values. Of course, it also depends on how the algorithm is implemented.
I am not sure about sklearn but if you really want to classify them while preserving the NaN values, your best choice is to use XGBoost. It is not on sklearn but there are very good libraries and they are easy to use as well. It is also one of the most powerful classifiers, so you should definitely try it!
https://xgboost.readthedocs.io/en/latest/python/python_intro.html
You can use a SimpleImputer() to replace nan by the mean value, or a constant prior to fitting the model. Have a look at the documentation to find the correct strategy that work for your usecase.
In your case if you want to have still have nan value and take them out of the equation, you can simply replace nan by 0 using SimpleImputer(strategy='constant', fill_value=0)
As follows:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
model = make_pipeline(
SimpleImputer(strategy='constant', fill_value=0),
LinearRegression()
)
model.fit(X, y)
Note: I am using here a pipeline to all the steps in one go.
A pandas.Series() called "bla" in my example contains pressures in Pa as the index and wind speeds in m/s as values:
bla
100200.0 2.0
97600.0 NaN
91100.0 NaN
85000.0 3.0
82600.0 NaN
...
6670.0 NaN
5000.0 2.0
4490.0 NaN
3880.0 NaN
3000.0 9.0
Length: 29498, dtype: float64
bla.index
Float64Index([100200.0, 97600.0, 91100.0, 85000.0, 82600.0, 81400.0,
79200.0, 73200.0, 70000.0, 68600.0,
...
11300.0, 10000.0, 9970.0, 9100.0, 7000.0, 6670.0,
5000.0, 4490.0, 3880.0, 3000.0],
dtype='float64', length=29498)
As the wind speed values are NaN more often than not, I intended to interpolate considering the different pressure levels in order to have more wind speed values to work with.
The docs of interpolate() state that there's a method called "index" which interpolates considering the index-values, but the results don't make sense as compared to the initial values:
bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0 **2.00**
97600.0 10.40
91100.0 8.00
85000.0 **3.00**
82600.0 9.75
...
6670.0 3.00
5000.0 **2.00**
4490.0 9.00
3880.0 5.00
3000.0 **9.00**
Length: 29498, dtype: float64
I marked the original values in boldface.
I'd rather expect something like when using "linear":
bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0 **2.000000**
97600.0 2.333333
91100.0 2.666667
85000.0 **3.000000**
82600.0 4.600000
...
6670.0 4.500000
5000.0 **2.000000**
4490.0 4.333333
3880.0 6.666667
3000.0 **9.000000**
Nevertheless, I'd like to use properly "index" as interpolation method, since this should be the most accurate considering the pressure levels for interpolation to mark the "distance" between each wind speed value.
By and large, I'd like to understand how the interpolation results using "index" with the pressure levels in it could become so counterintuitive, and how I could achieve them to be more sound.
Thanks to #ALollz in the first comment underneath my question I came up where the issue lied:
It was just that my dataframe had 2 index levels, the outer being unique measuring timestamps, the inner being a standard range-index.
I should've looked just at each sub-set associated with the unique timestamps separately.
Within these subsets, interpolation makes sense and the results are being produced just right.
Example:
# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
# Extract the current subset
df_subset = df.loc[timestamp, :]
# Carry out interpolation on a column of interest
df_subset["column of interest"] = df_subset[
"column of interest"].interpolate(method="linear",
axis=0,
limit=1,
limit_direction="both")
I am trying to aggregate pandas DataFrame and create 2 new columns that would be a slope and an intercept from a simple linear regression fit.
The dummy dataset looks like this:
CustomerID Month Value
a 1 10
a 2 20
a 3 20
b 1 30
b 2 40
c 1 80
c 2 90
And I want the output to look like this - which would regress Value against Month for each CustomerID:
CustomerID Slope Intercept
a 0.30 10
b 0.20 30
c 0.12 80
I know I could run a loop and then for each customerID run the linear regression model, but my dataset is huge and I need a vectorized approach. I tried using groupby and apply by passing linear regression function but didn't find a solution that would work.
Thanks in advance!
By using scpiy with groupby , here I am using for loop rather than apply , since apply is slower than for loop
from scipy import stats
pd.DataFrame.from_dict({y:stats.linregress(x['Month'],x['Value'])[:2] for y, x in df.groupby('CustomerID')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Out[798]:
Slope Intercept
a 5.0 6.666667
b 10.0 20.000000
c 10.0 70.000000
I have a dataset that looks like this
1908 January 5.0 -1.4
1908 February 7.3 1.9
1908 March 6.2 0.3
1908 April NaN 2.1
1908 May NaN 7.7
1908 June 17.7 8.7
1908 July NaN 11.0
1908 August 17.5 9.7
1908 September 16.3 8.4
1908 October 14.6 8.0
1908 November 9.6 3.4
1908 December 5.8 NaN
1909 January 5.0 0.1
1909 February 5.5 -0.3
1909 March 5.6 -0.3
1909 April 12.2 3.3
1909 May 14.7 4.8
1909 June 15.0 7.5
1909 July 17.3 10.8
1909 August 18.8 10.7
I want to replace the NaNs using KNN as the method. I looked up sklearns Imputer class but it supports only mean, median and mode imputation. There is a feature request here but I don't think that's been implemented as of now. Any ideas on how to replace the NaNs from the last two columns using KNN?
Edit:
Since I need to run codes in another environment, I don't have the luxury of installing packages. Sklearn, pandas, numpy, and other standard packages are the only ones I can use.
fancyimpute package supports such kind of imputation, using the following API:
from fancyimpute import KNN
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).complete(X_incomplete)
Here are the imputations supported by this package:
•SimpleFill: Replaces missing entries with the mean or median of each
column.
•KNN: Nearest neighbor imputations which weights samples using the
mean squared difference on features for which two rows both have
observed data.
•SoftImpute: Matrix completion by iterative soft thresholding of SVD
decompositions. Inspired by the softImpute package for R, which is
based on Spectral Regularization Algorithms for Learning Large
Incomplete Matrices by Mazumder et. al.
•IterativeSVD: Matrix completion by iterative low-rank SVD
decomposition. Should be similar to SVDimpute from Missing value
estimation methods for DNA microarrays by Troyanskaya et. al.
•MICE: Reimplementation of Multiple Imputation by Chained Equations.
•MatrixFactorization: Direct factorization of the incomplete matrix
into low-rank U and V, with an L1 sparsity penalty on the elements of
U and an L2 penalty on the elements of V. Solved by gradient descent.
•NuclearNormMinimization: Simple implementation of Exact Matrix
Completion via Convex Optimization by Emmanuel Candes and Benjamin
Recht using cvxpy. Too slow for large matrices.
•BiScaler: Iterative estimation of row/column means and standard
deviations to get doubly normalized matrix. Not guaranteed to converge
but works well in practice. Taken from Matrix Completion and Low-Rank
SVD via Fast Alternating Least Squares.
fancyimpute's KNN imputation no more supports the complete function as suggested by other answer, we need to now use fit_transform
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).fit_transform(X_incomplete)
reference https://github.com/iskandr/fancyimpute
scikit-learn v0.22 supports native KNN Imputation
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))
This pull request to sklearn adds KNN support. You can get the code from here.
I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.