removing NaN from a scipy sparse matrix

removing NaN from a scipy sparse matrix - python

I have the following piece of code:
input_data = pd.read_csv('file_name.tsv', sep='\t')
data = sparse.csr_matrix(data.values)
model = TruncatedSVD(n_components=2)
model.fit(data)
Now TruncatedSVD does take sparse matrices from scipy but it does not take NaN. I expected the crs_matrix function to strip NaN but it does not and I can't find a way to strip these NaN's from my scipy matrix.
Is there a good way to do this? I can't find a function within scipy.

I ended up setting the NaNs to zero, this is not the optimal solution but I don't think there really is a satisfying way to impute the missing values in this instance.

There are some approaches to this. One of which you chose - setting the NaNs to zero, other way may be to set them equal to the mean value of your data (column, let's say).
An easy way to adress this is with scikit imputer
from sklearn.reprocessing import Imputer
data_imputer = Imputer(missing_values='NaN, strategy='mean', axis=0)
# fit the imputer - suppose missing data is in the 0th column
data_imputer = data_imputer.fit(data[:, 0])
# transform the data
data[:, 0] = data_imputer.transform(X[:, 0])
Note that this is a very simple example and can be improved a lot, for more info, see scikit-learn doc documentation about this issue.

Related

Fit() method, sklearn in python

I'm new in the sklearn, and could someone explain to me why in the fit method os the linear regression, the predictor (X) is coded like this:
X = df[['highway-mpg']]
and the response variable is coded in this form:
Y = df['price']
I'm a little confused when I have to use df with double and single brackets, could someone explain to me, I tried to understand with the documentation od sklearn in the fit method, but I got more confused.

Double brackets:
They are used to select multiple columns from a DataFrame and the result is a DataFrame which is a 2D array.
Single bracket:
They are used to select one column from a DataFrame and the result is a Series which is an 1D array.
According to Sci-kit docs, in the fit method of LinearRegression, the shape of X should be (n_samples, n_features) and to do so, we use double brackets.

Scikit: Problem returning Dataframe from imputer instead of Numpy Array

I am trying to impute some missing values in a Dataframe using the scikit-learn IterativeImputer(). The problem is that the imputer will take the pandas dataframe as an input, but will return a numpy array instead of the original dataframe. Here is a simple example taken from this post.
# Create an empty dataset
df = pd.DataFrame()
# Create two variables called x0 and x1. Make the first value of x1 a missing value
df['x0'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
df['x1'] = [np.nan,0.2654,0.2615,0.5846,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]
imputer = IterativeImputer(max_iter=10, random_state=42)
imputer.fit(df)
imputed_df = imputer.transform(df)
imputed_df
The problem is that when the numpy array is returned, the column names are removed and other metadata. I can of course manually extract that metadata from the original dataframe and then reapply it, but that seems a bit hacky. Pandas has its own imputer in terms of Dataframe.fillna() but the algorithms are not as sophisticated as the scikit ones.
So is there a way to fit the imputer to a dataframe and return a dataframe from the result.

Yes you can , just assign the values back
df[:]= imputer.transform(df)

Sklearn Decision Tree - using sparse matrix and other features simultaneously

I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()

When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.

how to get original data from normalized array

I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?

You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.

All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?

You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.

Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())

The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.

The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

removing NaN from a scipy sparse matrix - python

I ended up setting the NaNs to zero, this is not the optimal solution but I don't think there really is a satisfying way to impute the missing values in this instance.

Related

Fit() method, sklearn in python

Scikit: Problem returning Dataframe from imputer instead of Numpy Array

Sklearn Decision Tree - using sparse matrix and other features simultaneously

how to get original data from normalized array

How do I compute the variance of a column of a sparse matrix in Scipy?

Categories

Resources