how to get original data from normalized array - python

I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?

You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.

All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700

Related

Sklearn Decision Tree - using sparse matrix and other features simultaneously

I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.

removing NaN from a scipy sparse matrix

I have the following piece of code:
input_data = pd.read_csv('file_name.tsv', sep='\t')
data = sparse.csr_matrix(data.values)
model = TruncatedSVD(n_components=2)
model.fit(data)
Now TruncatedSVD does take sparse matrices from scipy but it does not take NaN. I expected the crs_matrix function to strip NaN but it does not and I can't find a way to strip these NaN's from my scipy matrix.
Is there a good way to do this? I can't find a function within scipy.
I ended up setting the NaNs to zero, this is not the optimal solution but I don't think there really is a satisfying way to impute the missing values in this instance.
There are some approaches to this. One of which you chose - setting the NaNs to zero, other way may be to set them equal to the mean value of your data (column, let's say).
An easy way to adress this is with scikit imputer
from sklearn.reprocessing import Imputer
data_imputer = Imputer(missing_values='NaN, strategy='mean', axis=0)
# fit the imputer - suppose missing data is in the 0th column
data_imputer = data_imputer.fit(data[:, 0])
# transform the data
data[:, 0] = data_imputer.transform(X[:, 0])
Note that this is a very simple example and can be improved a lot, for more info, see scikit-learn doc documentation about this issue.

sklearn mask for onehotencoder does not work

Considering data like:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)
I want to exclude the text column using the OHE functionality.
Why does the following not work?
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'
It says in the documentation:
categorical_features: “all” or array of indices or mask :
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.
I'm using a mask, yet it still tries to convert to float.
Even using
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool),
dtype=dt)
ohe.fit(d)
Same error.
And also in the case of "array of indices":
ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)
ohe.fit(d)
You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.
If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})
# number_1 number_2 text
# 0 1 2 aaa
# 1 1 2 bbb
# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
('text', SomeEncoder),
(['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)
I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.
The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features.
So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:
d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)
Now you can check your feature categories:
ohe.active_features_
Out[22]: array([5, 6], dtype=int64)
I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is rather misleading in that regard.

How to pass float argument in predict function of scikit linear regression?

I am using scikit linear regression - single variable to predict y from x. The argument is in float datatype. How can i transform the float into numpy array to predict the output ?
import matplotlib.pyplot as plt
import pandas
import numpy as np
from sklearn import linear_model
import sys
colnames = ['charge_time', 'running_time']
data = pandas.read_csv('trainingdata.txt', names=colnames)
data = data[data.running_time < 8]
x = np.array(list(data.charge_time))
y = np.array(list(data.running_time))
clf = linear_model.LinearRegression() # Creating a Linear Regression Modal
clf.fit(x[:,np.newaxis], y) # Fitting x and y array as training set
data = float(sys.stdin.readline()) # Input is Float e.g. 4.8
print clf.predict(data[:,np.newaxis]) # As per my understanding parameter should be in 1-D array.
First of all, a suggestion not directly related to your question:
You don't need to do x = np.array(list(data.charge_time)), you can directly call x = np.array(data.charge_time) or, even better, x = data.charge_time.values which directly returns the underlying ndarray.
It is also not clear to me why you're adding a dimension to the input arrays using np.newaxis.
Regarding your question, predict expects an array-like parameters: that can be a list, a numpy array, or other.
So you should be able to just do data = np.array([float(sys.stdin.readline())]). Putting the float value in a list ([]) is needed because without it numpy would create a 0-d array (i.e. a single value, which is not sliceable) instead of a 1-d array.

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

Categories