I am using the Beers dataset in which I want to encode the data with datatype 'object'.
Following is my code.
from sklearn import preprocessing
df3 = BeerDF.select_dtypes(include=['object']).copy()
label_encoder = preprocessing.LabelEncoder()
df3 = df3.apply(label_encoder.fit_transform)
The following error is occurring.
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Any insights are helpful!!!
Use:
df3 = df3.astype(str).apply(label_encoder.fit_transform)
From the TypeError, it seems that the column you want to transform into label has two different data types (dtypes), in your case string and float which raises an error. To avoid this, you have to modify your column to have a uniform dtype (string or float). For example in Iris classification dataset, class=['Setosa', 'Versicolour', 'Virginica'] and not ['Setosa', 3, 5, 'Versicolour']
Related
I'm trying to replace the categorical variable in the Gender column - M, F with 0, 1. However, after running my code I'm getting NaN in place of 0 & 1.
Code-
df['Gender'] = df['Gender'].map({'F':1, 'M':0})
My input data frame-
Dataframe after running the code-
Details- Gender (Data Type) - object
Kindly, suggest a way out!
Maybe values in your dataframe are different from the expected strings 'F' and 'M'. Try to use LabelEncoder from SkLearn.
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
This particular code resolved the issue-
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Gender'.
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Gender'].unique()
I converted my dataset features into integers using the following code:
car_df = pd.DataFrame({'Integer Feature': [0,1,2,3,4,5],
'Categorical Feature': ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']})
This worked. Now, I am trying to create a decision tree and used the following code:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(car_df, y)
However, I get an error stating: ValueError: could not convert string to float: 'buying'
'Buying' is the first categorical feature in the dataset. There are six categorical features.
I thought that would not have been an issue since I converted the features to integers. Does anyone have an idea of how to fix this?
I just pulled this cars dataset so I have a better idea of its contents. Based on the documentation, here are the columns with possible values:
buying v-high, high, med, low
maint v-high, high, med, low
doors 2, 3, 4, 5-more
persons 2, 4, more
lug_boot small, med, big
safety low, med, high
So all of these columns can contain strings and they all need to be converted to numeric type before you can pass the dataset to your model's fit() method.
Per Pandas documentation of the get_dummies() method: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html:
Once you have your original dataset in a dataframe (call it df), you can pass it to the .get_dummies() method like this:
import pandas as pd
df_with_dummies = pd.get_dummies(df)
This code will convert any columns with object or category dtype to integer dummies and will name each new column using the {original column name}_{original value} convention.
I tried to use categories instead of categorical_features, but it did not help.
Please help with the error:
Traceback (most recent call last): File "test.py", line 28, in <module> onehotencoder = OneHotEncoder(categorical_features=[0]) TypeError: __init__() got an unexpected keyword argument 'categorical_features'
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0]) #Encoding the values of column Country
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
print(X)
Acc to the documentation their is not any attribute 'categorical_features'
categories‘auto’ or a list of array-like, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the ith column. The passed >categories should not mix strings and numeric values within a single feature, and should >be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
...
X = ...
# Encoding the values of column Country
onehotencoder = OneHotEncoder(sparse=False)
X = np.concatenate(
onehotencoder.fit_transform(X[:,0:1]),
X[:,1:],
axis=1
)
print(X)
# Do this to show what categories are collected and encoded.
print(onehotencoder.categories_)
The old version of scikit-learn did all the projection, encoding and re-merging, sinking non-categorical columns to the right. It doesn't support it now, so we instead extract the Country column manually, pass it through the encoder, and concatenate them together.
OneHotEncoder by default returns "sparse" arrays, and we can avoid converting them into numpy arrays by specifying sparse=False.
Note that LabelEncoder is redundant since OneHotEncoder can automatically fit to string values anyway (at least in recent versions).
I am trying to encode my dataframe which is in the form strings but i am receiving this error :
error :
'<' not supported between instances of 'str' and 'NoneType'",
'occurred at index ProductFabric'
CODE:
from sklearn import preprocessing
df1=df1.apply(preprocessing.LabelEncoder().fit_transform)
here is an example from sklearn documentation hope this will help you
however in your case, you are taking df which might be a dataFrame with multiple column or there might be null values
from sklearn import preprocessing
df = [1, 1, 2, 6]
le = preprocessing.LabelEncoder().fit_transform(df)
print(le)
Image of ull error
I am trying to run LabelEncoder on all columns that are of type object. This is the code I wrote but it throws this error:
TypeError: '<' not supported between instances of 'int' and 'str'
Does anybody know how to fix this?
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])
Looks like it has different types while appending. You try converting all to str at fit method:
le.fit(data.values.astype(str))
And you have to change your data type to str for transform as well since the classes in LabelEncoder will be str:
X_train[col]=le.transform(X_train[col].astype(str))
X_test[col]=le.transform(X_test[col].astype(str))
Trying to recreate similar problem. If dataframe has values with int and str:
import pandas as pd
df = pd.DataFrame({'col1':["tokyo", 1 , "paris"]})
print(df)
Result:
col1
0 tokyo
1 1
2 paris
Now, using Labelenconder would give similar error message i.e. TypeError: unorderable types: int() < str() :
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values)
Converting all to str in fit or before may resolve issue:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values.astype(str))
print(le.classes_)
Result:
['1' 'paris' 'tokyo']
If you just call le.transform(df.col1), it will throw similar error again.
So, it has to be le.transform(df.col1.astype(str)) instead.
The error is basically telling you the exact problem: some of the values are strings and some are not. You can solve this by calling c.astype(str) each time you call fit, fit_transform, or transform, on Series c, e.g.:
le.fit(data.values.astype(str))