I tried to use categories instead of categorical_features, but it did not help.
Please help with the error:
Traceback (most recent call last): File "test.py", line 28, in <module> onehotencoder = OneHotEncoder(categorical_features=[0]) TypeError: __init__() got an unexpected keyword argument 'categorical_features'
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0]) #Encoding the values of column Country
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
print(X)
Acc to the documentation their is not any attribute 'categorical_features'
categories‘auto’ or a list of array-like, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the ith column. The passed >categories should not mix strings and numeric values within a single feature, and should >be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
...
X = ...
# Encoding the values of column Country
onehotencoder = OneHotEncoder(sparse=False)
X = np.concatenate(
onehotencoder.fit_transform(X[:,0:1]),
X[:,1:],
axis=1
)
print(X)
# Do this to show what categories are collected and encoded.
print(onehotencoder.categories_)
The old version of scikit-learn did all the projection, encoding and re-merging, sinking non-categorical columns to the right. It doesn't support it now, so we instead extract the Country column manually, pass it through the encoder, and concatenate them together.
OneHotEncoder by default returns "sparse" arrays, and we can avoid converting them into numpy arrays by specifying sparse=False.
Note that LabelEncoder is redundant since OneHotEncoder can automatically fit to string values anyway (at least in recent versions).
Related
I used FeatureAgglomeration to cluster my 105x105 dataframe into 40 clusters based on Spearman. Now I want to get the output feature names using feature_names_in and get_feature_names_out, but it does not seem to work, and I cannot find the solution anymore. This is my code:
import pandas as pd
import numpy as np
from sklearn.cluster import FeatureAgglomeration
features = np.array([...])
print(features.shape)
>>> (105,)
Class1_rank=pd.read_excel(r'H:\PycharmProjects\RadiomicsPipeline\Class1_rank.xlsx')
print(Class1_rank)
>>> original_shape_Elongation ... original_ngtdm_Strength
original_shape_Elongation 1.000000 ... -0.054310
original_shape_Flatness 0.616327 ... -0.019544
original_shape_LeastAxisLength 0.271645 ... -0.293157
>>> [105 rows x 105 columns]
print(agglo.n_features_in_)
>>> 105
print(agglo.feature_names_in_(Class1_rank))
print(agglo.get_feature_names_out())
df_reduced = agglo.transform(Class1)
At print(agglo.feature_names_in_()) I get to following error:
TypeError: 'numpy.ndarray' object is not callable
However, Class1_rank is a DataFrame, and thus should not give that error? What I am doing wrong here?
What I have tried:
Comment print(agglo.feature_names_in_(Class1_rank)). Works, but then print(agglo.get features out) gives the following result, and not the names of the features I included.
['featureagglomeration0' 'featureagglomeration1' 'featureagglomeration2' 'featureagglomeration3' 'featureagglomeration4'....]
Use features as input for both functions, gives the same error.
Insert the features as strings for Class1_rank, gives the same error.
feature_names_in_ is an array, not a callable, so agglo.feature_names_in_ is correct, but parentheses after it (empty or not) is incorrect.
get_feature_names_out() gives names for each cluster, which are not in 1-1 correspondence with input features, so it cannot give you something like the original feature names. You can use the labels_ attribute to find which input features go into which output features, see e.g. this answer.
I am using the Beers dataset in which I want to encode the data with datatype 'object'.
Following is my code.
from sklearn import preprocessing
df3 = BeerDF.select_dtypes(include=['object']).copy()
label_encoder = preprocessing.LabelEncoder()
df3 = df3.apply(label_encoder.fit_transform)
The following error is occurring.
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Any insights are helpful!!!
Use:
df3 = df3.astype(str).apply(label_encoder.fit_transform)
From the TypeError, it seems that the column you want to transform into label has two different data types (dtypes), in your case string and float which raises an error. To avoid this, you have to modify your column to have a uniform dtype (string or float). For example in Iris classification dataset, class=['Setosa', 'Versicolour', 'Virginica'] and not ['Setosa', 3, 5, 'Versicolour']
Image of ull error
I am trying to run LabelEncoder on all columns that are of type object. This is the code I wrote but it throws this error:
TypeError: '<' not supported between instances of 'int' and 'str'
Does anybody know how to fix this?
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])
Looks like it has different types while appending. You try converting all to str at fit method:
le.fit(data.values.astype(str))
And you have to change your data type to str for transform as well since the classes in LabelEncoder will be str:
X_train[col]=le.transform(X_train[col].astype(str))
X_test[col]=le.transform(X_test[col].astype(str))
Trying to recreate similar problem. If dataframe has values with int and str:
import pandas as pd
df = pd.DataFrame({'col1':["tokyo", 1 , "paris"]})
print(df)
Result:
col1
0 tokyo
1 1
2 paris
Now, using Labelenconder would give similar error message i.e. TypeError: unorderable types: int() < str() :
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values)
Converting all to str in fit or before may resolve issue:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values.astype(str))
print(le.classes_)
Result:
['1' 'paris' 'tokyo']
If you just call le.transform(df.col1), it will throw similar error again.
So, it has to be le.transform(df.col1.astype(str)) instead.
The error is basically telling you the exact problem: some of the values are strings and some are not. You can solve this by calling c.astype(str) each time you call fit, fit_transform, or transform, on Series c, e.g.:
le.fit(data.values.astype(str))
Let's consider the dataset of House prices from this example.
I have the entire dataset stored in the housing variable:
housing.shape
(20640, 10)
I also have done a OneHotEncoder encoding of one dimensions and get housing_cat_1hot, so
housing_cat_1hot.toarray().shape
(20640, 5)
My target is to join the two variables and store everything in just one dataset.
I have tried the Join with index tutorial but the problem is that the second matrix haven't any index.
How can I do a JOIN between housing and housing_cat_1hot?
>>> left=housing
>>> right=housing_cat_1hot.toarray()
>>> result = left.join(right)
Traceback (most recent call last): File "", line 1, in
result = left.join(right) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5293, in join
rsuffix=rsuffix, sort=sort) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5323, in _join_compat
can_concat = all(df.index.is_unique for df in frames) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5323, in
can_concat = all(df.index.is_unique for df in frames) AttributeError: 'numpy.ndarray' object has no attribute 'index'
Well, depends on how you created the one-hot vector.
But if it's sorted the same as your original DataFrame, and itself is a DataFrame, you can add the same index before joining:
housing_cat_1hot.index = range(len(housing_cat_1hot))
And if it's not a DataFrame, convert it to one.
This is simple, as long as both objects are sorted the same
Edit: If it's not a DataFrame, then:
housing_cat_1hot = pd.DataFrame(housing_cat_1hot)
Already creates the proper index for you
If you wish to join the two arrays (assuming both housing_cat_1hot and housing are arrays), you can use
housing = np.hstack((housing, housing_cat_1hot))
Though the best way to OneHotEncode a variable is selecting that variable within the array and encode. It saves you the trouble of joining the two later
Say the index of the variable you wish to encode in your array is 1,
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
Thanks to #Elez-Shenhar answer I get the following working code:
OneHot=housing_cat_1hot.toarray()
OneHot= pd.DataFrame(OneHot)
result = housing.join(OneHot)
result.shape
(20640, 15)
I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.
I have a big data frame and need to transform multiple columns in a similar way.
I have multiple column names in the variable other
the source code documentation here
states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())
I get the following error:
raise ValueError("bad input shape {0}".format(shape))
ValueError: ['EFW', 'BPD']: bad input shape (154, 2)
When I use the following code, it works great:
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())
To my understanding, both should work well and yield same results.
What am I doing wrong here?
Thanks!
The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:
cols = [(col, LabelEncoder()) for col in other]
Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
mapper2 = DataFrameMapper([
(['children', 'salary'], sklearn.decomposition.PCA(1))
])
This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.
If we take a look at the source code itself,
class DataFrameMapper(BaseEstimator, TransformerMixin):
"""
Map Pandas data frame column subsets to their own
sklearn transformation.
"""
def __init__(self, features, default=False, sparse=False, df_out=False,
input_df=False):
"""
Params:
features a list of tuples with features definitions.
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
sklearn's transform interface, or a list of such objects.
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.
As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.