sklearn mask for onehotencoder does not work - python

Considering data like:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)
I want to exclude the text column using the OHE functionality.
Why does the following not work?
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'
It says in the documentation:
categorical_features: “all” or array of indices or mask :
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.
I'm using a mask, yet it still tries to convert to float.
Even using
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool),
dtype=dt)
ohe.fit(d)
Same error.
And also in the case of "array of indices":
ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)
ohe.fit(d)

You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.
If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})
# number_1 number_2 text
# 0 1 2 aaa
# 1 1 2 bbb
# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
('text', SomeEncoder),
(['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)

I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.
The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features.
So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:
d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)
Now you can check your feature categories:
ohe.active_features_
Out[22]: array([5, 6], dtype=int64)

I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is rather misleading in that regard.

Related

Difference between a numpy array and a numpy vector

I wanted to know the difference between these two lines of code
X_train = training_dataset.iloc[:, 1].values
X_train = training_dataset.iloc[:, 1:2].values
My guess is that the latter is a 2-D numpy array and the former is a 1-D numpy array. For inputs in a neural network, the latter is the proper way for input data, is there are specific reason for that?
Please help!
Not quite that, they have both have ndim=2, just check by doing this:
X_train.ndim
The difference is that in the second one it doesn't have a defined second dimension if you want to see the difference between the shapes I suggest reading this: Difference between numpy.array shape (R, 1) and (R,)
Difference is iloc returns a Series with a single row or column is selected but a Dataframe with a multiple row or column ranges reference
Although they both refer to column 1, 1 and 1:2 are different types, with 1 representing an int and 1:2 representing a slice.
With,
X_train = training_dataset.iloc[:, 1].values
You specify a single column so training_dataset.iloc[:, 1] is a Pandas Series, so .values is a 1D Numpy array
Vs.,
X_train = training_dataset.iloc[:, 1:2].values
Although it becomes one column, [1:2] is a slice you represents a column range so training_dataset.iloc[:, 1:2] is a Pandas Dataframe. Thus, .values is a 2D Numpy array
Test as follows:
Create training_dataset Dataframe
data = {'Height':[1, 14, 2, 1, 5], 'Width':[15, 25, 2, 20, 27]}
training_dataset = pd.DataFrame(data)
Using .iloc[:, 1]
print(type(training_dataset.iloc[:, 1]))
print(training_dataset.iloc[:, 1].values)
# Result is:
<class 'pandas.core.series.Series'>
# Values returns a 1D Numpy array
0 15
1 25
2 2
3 20
4 27
Name: Width, dtype: int64,
Using iloc[:, 1:2]
print(type(training_dataset.iloc[:, 1:2]))
print(training_dataset.iloc[:, 1:2].values)
# Result is:
<class 'pandas.core.frame.DataFrame'>
# Values is a 2D Numpy array (since values of Pandas Dataframe)
[[15]
[25]
[ 2]
[20]
[27]],
X_train Values Var Type <class 'numpy.ndarray'>

How Do I pass numpy array as Categorical feature in Catboost Python

I want to pass 12th column of a numpy array as categorical feature.
The column has int values from 1 to 10.
I tried this:
cbr.fit(X_train, y,
eval_set=(X_train_test, y_test),
cat_features=[X_train[:,12]],
use_best_model=True,
verbose=100)
But got this error:
CatboostError: 'data' is numpy array of np.float32, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features
Categorical features cannot be float values. The reason for that is that categorical features are treated as strings and we must have the same string in case if you read feature value from file or from dataframe. We cannot do it for float values, but we can do it for strings and for integers.
To solve your problem you need to use dataframe where columns with categorical features will be of integer or string type.
For example,
from catboost import CatBoostClassifier, Pool
import pandas as pd
data = pd.DataFrame({'string_column': ['val0', 'val1', 'val2'],
'int_column': [1,2,3],
'float_column': [1.2,2,4.1]})
print(data)
print(data.dtypes)
train_data = Pool(
data=data,
label=[1, 1, -1],
weight=[0.1, 0.2, 0.3],
cat_features=[0, 1]
)
model = CatBoostClassifier(iterations = 10)
model.fit(X=train_data)
It is quite literally impossible to use categorical features in Catboost using a numpy array.
The reason being that it converts to one data-type for the whole array(float) and Catboost requires your categorical features to be of type int. Mixing is not possible.
Now you could build a dataframe instead and ensure that the dtypes in it is correct.
df = df.astype(dtype={
'cat_feature1':int,
...
})
From there you could do this:
df_int_list = df.select_dtypes(include='int').values.tolist()
df_no_int_list = df.select_dtypes(exclude='int').values.tolist()
df_list = []
for i,v in enumerate(df_int_list):
df_list = df_list + [v+df_no_int_list[i]]
This works because dataframe.Values will convert to a numpy array and then it will convert it to a list. If you only have integer values in the list it will use that.
cat_features=list(range(0,len(dataframe_int_list[0])))
train_data = Pool(
data=df_list, # ensure your target values are removed
label=... # insert your target values
cat_features=cat_features
)
model = CatBoostClassifier()
model.fit(X=train_data)

Sklearn Decision Tree - using sparse matrix and other features simultaneously

I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.

how to get original data from normalized array

I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700

Numpy: Inverse Transforming Different Size Array

I'm trying to get the hang of normalizing my data, doing some work on it and then changing it back. When doing an inverse_transform do I have to always pass in the exact same shape as it was when I did a fit_transform? The code below will give me a "non-broadcastable output operand with shape (3,1) doesn't match the broadcast shape (3,3)"
import numpy as np
from sklearn.preprocessing import MinMaxScaler
first = np.array([[ 1.2345, 1.220000,1.26245],
[ 1.234,1.220000,7.0901],
[ 1.23450,1.22000,1.14795]])
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(first)
new_dataset = dataset[:,:1]
trainPredict2 = scaler.inverse_transform(new_dataset)
You don't have to pass a data set with exactly the same shape, but the number of columns must match the original data set as each row is interpreted as a record and each column is interpreted as a feature. And you can not miss features for your testing data set, technically. So for instance, slicing the rows will still work:
new_dataset = dataset[:1,:]
trainPredict2 = scaler.inverse_transform(new_dataset)
This gives back the first row of your original data set:
trainPredict2
# array([[ 1.2345 , 1.22 , 1.26245]])
If you really want to inverse one or two features, you can calculate this by inversing the min-max transformation x′:=(x−xmin)/(xmax−xmin) formula, :
scaler.data_range_[:1] * dataset[:,:1] + scaler.data_min_[:1]
# array([[ 1.2345],
# [ 1.234 ],
# [ 1.2345]])
This gives back the first column of your original data set.

Categories