Running sklearns label encoder on all columns at once - python

Image of ull error
I am trying to run LabelEncoder on all columns that are of type object. This is the code I wrote but it throws this error:
TypeError: '<' not supported between instances of 'int' and 'str'
Does anybody know how to fix this?
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])

Looks like it has different types while appending. You try converting all to str at fit method:
le.fit(data.values.astype(str))
And you have to change your data type to str for transform as well since the classes in LabelEncoder will be str:
X_train[col]=le.transform(X_train[col].astype(str))
X_test[col]=le.transform(X_test[col].astype(str))
Trying to recreate similar problem. If dataframe has values with int and str:
import pandas as pd
df = pd.DataFrame({'col1':["tokyo", 1 , "paris"]})
print(df)
Result:
col1
0 tokyo
1 1
2 paris
Now, using Labelenconder would give similar error message i.e. TypeError: unorderable types: int() < str() :
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values)
Converting all to str in fit or before may resolve issue:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values.astype(str))
print(le.classes_)
Result:
['1' 'paris' 'tokyo']
If you just call le.transform(df.col1), it will throw similar error again.
So, it has to be le.transform(df.col1.astype(str)) instead.

The error is basically telling you the exact problem: some of the values are strings and some are not. You can solve this by calling c.astype(str) each time you call fit, fit_transform, or transform, on Series c, e.g.:
le.fit(data.values.astype(str))

Related

Getting NaN in a column after applying map() function

I'm trying to replace the categorical variable in the Gender column - M, F with 0, 1. However, after running my code I'm getting NaN in place of 0 & 1.
Code-
df['Gender'] = df['Gender'].map({'F':1, 'M':0})
My input data frame-
Dataframe after running the code-
Details- Gender (Data Type) - object
Kindly, suggest a way out!
Maybe values in your dataframe are different from the expected strings 'F' and 'M'. Try to use LabelEncoder from SkLearn.
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
This particular code resolved the issue-
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Gender'.
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Gender'].unique()

Binarize categorical column within pipeline

I would like to include a one-vs-all binary classifier into my pipeline, but I'm struggling to do so.
I have a column of country names which I would like to convert into a binary format. For example: I would like Portugal to be replaced by 1 and all the other countries to 0.
I tried defining a function (column here would be country and true_value PRT):
def binary_tansformer_cat(column, true_value):
df[column] = (df[column] == true_value).astype(int)
However, this does not really work and I get the following error:
TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'None' (type <class 'NoneType'>) doesn't.
I've also tried using LabelBinarizer but it doesn't work as well.
I am happy about any help! Let me know if you have any questions.

TypeError while using label encoder

I am using the Beers dataset in which I want to encode the data with datatype 'object'.
Following is my code.
from sklearn import preprocessing
df3 = BeerDF.select_dtypes(include=['object']).copy()
label_encoder = preprocessing.LabelEncoder()
df3 = df3.apply(label_encoder.fit_transform)
The following error is occurring.
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Any insights are helpful!!!
Use:
df3 = df3.astype(str).apply(label_encoder.fit_transform)
From the TypeError, it seems that the column you want to transform into label has two different data types (dtypes), in your case string and float which raises an error. To avoid this, you have to modify your column to have a uniform dtype (string or float). For example in Iris classification dataset, class=['Setosa', 'Versicolour', 'Virginica'] and not ['Setosa', 3, 5, 'Versicolour']

Error during Label Encoding Sci-kit Library

I am trying to encode my dataframe which is in the form strings but i am receiving this error :
error :
'<' not supported between instances of 'str' and 'NoneType'",
'occurred at index ProductFabric'
CODE:
from sklearn import preprocessing
df1=df1.apply(preprocessing.LabelEncoder().fit_transform)
here is an example from sklearn documentation hope this will help you
however in your case, you are taking df which might be a dataFrame with multiple column or there might be null values
from sklearn import preprocessing
df = [1, 1, 2, 6]
le = preprocessing.LabelEncoder().fit_transform(df)
print(le)

Python 3.6.5 returns '<' not supported between instances of 'tuple' and 'str' error message

I'm trying to split a data set into a training and testing part. I am struggling at a structural problem as it seems as the hierarchy of the data seems to be wrong to proceed with below code.
I tried the following:
import pandas as pd
data = pd.DataFrame(web.DataReader('SPY', data_source='morningstar')['Close'])
cutoff = '2015-1-1'
data = data[data.index < cutoff].dropna().copy()
As data.head() will reveal, data is not actually a pd.DataFrame but a pd.Series whose index is a pd.MultiIndex (as suggested also by the error which hints that each element is a tuple) rather than a pd.DatetimeIndex.
What you could do would be to simply let
df = data.unstack(0)
With that, df[df.index < cutoff] performs the filtering you are trying to do.

Categories