I'm trying to get the hang of normalizing my data, doing some work on it and then changing it back. When doing an inverse_transform do I have to always pass in the exact same shape as it was when I did a fit_transform? The code below will give me a "non-broadcastable output operand with shape (3,1) doesn't match the broadcast shape (3,3)"
import numpy as np
from sklearn.preprocessing import MinMaxScaler
first = np.array([[ 1.2345, 1.220000,1.26245],
[ 1.234,1.220000,7.0901],
[ 1.23450,1.22000,1.14795]])
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(first)
new_dataset = dataset[:,:1]
trainPredict2 = scaler.inverse_transform(new_dataset)
You don't have to pass a data set with exactly the same shape, but the number of columns must match the original data set as each row is interpreted as a record and each column is interpreted as a feature. And you can not miss features for your testing data set, technically. So for instance, slicing the rows will still work:
new_dataset = dataset[:1,:]
trainPredict2 = scaler.inverse_transform(new_dataset)
This gives back the first row of your original data set:
trainPredict2
# array([[ 1.2345 , 1.22 , 1.26245]])
If you really want to inverse one or two features, you can calculate this by inversing the min-max transformation x′:=(x−xmin)/(xmax−xmin) formula, :
scaler.data_range_[:1] * dataset[:,:1] + scaler.data_min_[:1]
# array([[ 1.2345],
# [ 1.234 ],
# [ 1.2345]])
This gives back the first column of your original data set.
Related
I have a large (n x dim) array, each row is a vector in a space (whatever the dimension but let's do it in 2D):
import numpy as np
A = np.array([[50,14],[26,11],[81,9],[-11,-19]])
A.shape
(4,2)
I want to quickly compute the unit vector for each of those rows.
N = np.linalg.norm(A, axis=1)
# something like this, but for each row:
A /= N # not working:
# ValueError: operands could not be broadcast together
# with shapes (4,2) (4,) (4,2)
# or in a pandas-like manner:
np.divide(A, N, axis=1, inplace=True) # not working either
How could you do that properly?
You can use a broadcasting operation such as:
A /= np.linalg.norm(A, axis=1)[:,None]
# or
A /= np.linalg.norm(A, axis=1).reshape(4,1)
which both will give the array a shape of (4,1) instead of (4,)
But beware, A.dtype should be float64* otherwise you will encounter this error when using ufunc
A /= np.linalg.norm(A, axis=1)[:,None]
TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to
provided output parameter (typecode 'l') according to the casting
rule ''same_kind''
But doing it as follows will work, no matter the value of A.dtype:
A = A/np.linalg.norm(A, axis=1)[:,None]
*
To initialize the array with float64 you can simply add a comma to one of the numbers:
A = np.array([[50.,14],[26,11],[81,9],[-11,-19]])
You can also use the normalize feature of scikit learn's preprocessing toolbox:
import sklearn
sklearn.__version__ # 0.24.2
from sklearn.preprocessing import normalize
normalize(A, norm="l2", axis=1)
array([[ 0.962964 , 0.2696299],
[ 0.9209673, 0.38964 ],
[ 0.9938837, 0.1104315],
[-0.5010363, -0.8654263]])
# as per the doc, you can set the copy flag to False to perform inplace row
# normalization and avoid a copy (if the input is already a numpy array or a
# scipy.sparse CSR matrix and if axis is 1):
normalize(A, norm="l2", axis=1, copy=False)
array([[ 0.962964 , 0.2696299],
[ 0.9209673, 0.38964 ],
[ 0.9938837, 0.1104315],
[-0.5010363, -0.8654263]])
I have a task to create a 30x40 feature matrix with random integers between 1 & 100:
import numpy as np
matrix= np.random.randint(1,100,size=(30,40))
Next I need to rescale the elements in the matrix to be between the range 5-10:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaler.fit (5,10)
matrix1 = scaler.fit_transform(matrix)
Which gives me this error:
ValueError: Expected 2D array, got scalar array instead:
array=5.0.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
I've tried reshaping the data:
matrix.reshape(-1,1)
but I get the same error.
I think you need to define the feature range when you create an instance of MinMaxScaler like this:
scaler = preprocessing.MinMaxScaler(feature_range=(5, 10))
And then you could fit and transform the data like this:
matrix1 = scaler.fit_transform(matrix)
The last line is a short form for:
scaler.fit(matrix)
matrix1 = scaler.transform(matrix)
I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.
Considering data like:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)
I want to exclude the text column using the OHE functionality.
Why does the following not work?
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'
It says in the documentation:
categorical_features: “all” or array of indices or mask :
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.
I'm using a mask, yet it still tries to convert to float.
Even using
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool),
dtype=dt)
ohe.fit(d)
Same error.
And also in the case of "array of indices":
ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)
ohe.fit(d)
You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.
If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})
# number_1 number_2 text
# 0 1 2 aaa
# 1 1 2 bbb
# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
('text', SomeEncoder),
(['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)
I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.
The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features.
So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:
d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)
Now you can check your feature categories:
ohe.active_features_
Out[22]: array([5, 6], dtype=int64)
I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is rather misleading in that regard.
I had a pandas dataframe that had columns with strings from 0-9 as column names:
working_df = pd.DataFrame(np.random.rand(5,10),index=range(0,5), columns=[str(x) for x in range(10)])
working_df.loc[:,'outcome'] = [0,1,1,0,1]
I then wanted to get an array of all of these numbers into one column so I did:
array_list = [Y for Y in x[[str(num) for num in range(10)]].values]
which gave me:
[array([ 0.0793451 , 0.3288617 , 0.75887129, 0.01128641, 0.64105905,
0.78789297, 0.69673768, 0.20354558, 0.48976411, 0.72848541]),
array([ 0.53511388, 0.08896322, 0.10302786, 0.08008444, 0.18218731,
0.2342337 , 0.52622153, 0.65607384, 0.86069294, 0.8864577 ]),
array([ 0.82878026, 0.33986175, 0.25707122, 0.96525733, 0.5897311 ,
0.3884232 , 0.10943644, 0.26944414, 0.85491211, 0.15801284]),
array([ 0.31818888, 0.0525836 , 0.49150727, 0.53682492, 0.78692193,
0.97945708, 0.53181293, 0.74330327, 0.91364064, 0.49085287]),
array([ 0.14909577, 0.33959452, 0.20607263, 0.78789116, 0.41780657,
0.0437907 , 0.67697385, 0.98579928, 0.1487507 , 0.41682309])]
I then attached it to my dataframe using:
working_df.loc[:,'array_list'] = pd.Series(array_list)
I then setup my rf_clf = RandomForestClassifier() and I try to rf_clf.fit(working_df['array_list'][1:].values, working_df['outcome'][1:].values) which results in the ValueError: setting an array element with sequence
Is it a problem with the array of arrays in the fitting? Thanks for any insight.
The problem is that scikit-learn expects a two-dimensional array of values as input. You're passing a one dimensional array of objects (with each object itself being a one-dimensional array).
A quick fix would be to do this:
X = np.array(list(working_df['array_list'][1:]))
y = working_df['outcome'][1:].values
rf_clf.fit(X, y)
A better fix would be to not store your two-dimensional feature array within a one-dimensional pandas column.