Binarize categorical column within pipeline

Binarize categorical column within pipeline - python

I would like to include a one-vs-all binary classifier into my pipeline, but I'm struggling to do so.
I have a column of country names which I would like to convert into a binary format. For example: I would like Portugal to be replaced by 1 and all the other countries to 0.
I tried defining a function (column here would be country and true_value PRT):
def binary_tansformer_cat(column, true_value):
df[column] = (df[column] == true_value).astype(int)
However, this does not really work and I get the following error:
TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'None' (type <class 'NoneType'>) doesn't.
I've also tried using LabelBinarizer but it doesn't work as well.
I am happy about any help! Let me know if you have any questions.

Related

Pipeline with SimpleImputer and OneHotEncoder - how to do properly?

I facing a challenge to create a pipeline to impute (SI) a category variable (eg colour) and then onehotencode (OHE) 2 variables (eg colour & dayofweek). colour is used in the 2 steps.
I wanted to put SI and OHE in 1 ColumnTransformer. I just learnt that both SI and OHE running in parallel, meaning OHE will not encode the imputed colour (ie OHE the original un-imputed colour.)
I then tried:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False,drop='first')
ctsi = ColumnTransformer(transformers=[('si',si,['colour'])],remainder='passthrough')
ctohe = ColumnTransformer(transformers=[('ohe',ohe,['colour','dayofweek'])],remainder='passthrough')
pl = Pipeline([('ctsi',ctsi),('ctohe',ctohe)])
outfit = pl.fit_transform(X,y)
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I believe it's because the column name colour has been removed by SI. When I change the OHE columns to a list of int:
ctohe = ColumnTransformer(transformers=[('ohe',ohe,[0,1])],remainder='passthrough')
It goes through. I'm just testing the processing, obviously, the columns are incorrect.
So my challenge here is that given what I want to accomplish, is it possible ? And how can I do that ?
Great many thanks in advance !

Actually, I agree with your reasoning. The problem is given by the fact that ColumnTransformer forgets the column names after the transformation and indeed - quoting the answer in here - ColumnTransformer's intended usage is to deal with transformations applied in parallel. That's also specified in the doc by means of this sentence in my opinion:
This estimator allows different columns or column subsets of the input to be transformed separately. [...] This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
I guess that one solution to this might be to go with a custom renaming of the columns, passing a callable to the columns portion of ColumnTransformer's transformers tuple (name, transformer, columns) (notation which follows the documentation) according to your needs (actually, this would work I guess if you pass a callable to your second ColumnTransformer instance in the pipeline).
EDIT: I have to withdraw somehow what I wrote, I'm not actually sure that passing a callable to columns might work for your need as your problem does not really stand in column selection per se, but rather in column selection via string column names, for which you would need a DataFrame (and imo acting on column selector only won't solve such problem).
Instead, you might better need a transformer that somehow changes column names after the imputation and before the one-hot-encoding (still provided that the setting is not the ideal one when different instances of ColumnTransformer have to transform the same variables in sequence in a Pipeline) acting on a DataFrame.
Actually a couple of months ago, the following https://github.com/scikit-learn/scikit-learn/pull/21078 was merged; I suspect it is not still in the latest release because by upgrading sklearn I couldn't get it to work. Anyway, IMO, in the future it may ease in similar situations as it adds get_feature_names_out() to SimpleImputer and get_feature_names_out() is in turn really useful when dealing with column names.
In general, I would also suggest the same post linked above for further details.
Eventually, here's a naive example I could get to; it's not scalable (I tried to get to something more scalable exploiting feature_names_in_ attribute of the fitted SimpleImputer instance, without arriving to a consistent result) but hopefully might give some hints.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.NaN],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
ct_1 = ColumnTransformer([('si', SimpleImputer(strategy='most_frequent'), ['city'])],
remainder='passthrough')
ct_2 = ColumnTransformer([('ohe', OneHotEncoder(), ['city'])], remainder='passthrough', verbose_feature_names_out=True)
class ColumnExtractor(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def transform(self, X, *_):
return pd.DataFrame(X, columns=self.columns)
def fit(self, *_):
return self
pipe = Pipeline([
('ct_1', ct_1),
('ce', ColumnExtractor(['city', 'title', 'expert_rating', 'user_rating'])),
('ct_2', ct_2)
])
pipe.fit_transform(X)

I think the simplest option (for now) is separate pipelines for the two columns:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False, drop='first'))
siohe = Pipeline([
('si', si),
('ohe', ohe),
])
ct = ColumnTransformer(
transformers=[
('si', siohe, ['colour']),
('siohe', ohe, ['dayofweek']),
],
remainder='passthrough'
)
(You specified the strategy as 'mean', which is probably not what you want, but I've left it in the code above.)
You might also consider just using siohe for both columns: if there are no missings in dayofweek (in production data!) then there's no real difference.

TypeError while using label encoder

I am using the Beers dataset in which I want to encode the data with datatype 'object'.
Following is my code.
from sklearn import preprocessing
df3 = BeerDF.select_dtypes(include=['object']).copy()
label_encoder = preprocessing.LabelEncoder()
df3 = df3.apply(label_encoder.fit_transform)
The following error is occurring.
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Any insights are helpful!!!

Use:
df3 = df3.astype(str).apply(label_encoder.fit_transform)

From the TypeError, it seems that the column you want to transform into label has two different data types (dtypes), in your case string and float which raises an error. To avoid this, you have to modify your column to have a uniform dtype (string or float). For example in Iris classification dataset, class=['Setosa', 'Versicolour', 'Virginica'] and not ['Setosa', 3, 5, 'Versicolour']

ValueError during LogisticRegression

I have an error during Logistic Regression, when I code something like this:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
I have an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
What should I do ?

this is a decision you have to take based on the data and the feature which is NaN. imputing this directly with 0 will affect your results.
you can start with these following things.
remove those rows, do training
if categorical: replace with mode
if continuous: replace with mean
if sequential (like time series) try to replace them with mean of the row above and below.
if there are continuous missing try to do interpolation.
so on... please explain more about the data.

It Indicates that your data contain non-numeric data like "NaN", "Null" or "N/A".
you can handle such data by using following command of pandas:
1. Fillna(0)
2. Dropna()

sklearn, ValueError: could not convert string to float, even if I'm not using strings

I am trying to fit a Random Forest with sklearn.
Everytime I run my algorithm, I encounter the error:
ValueError: could not convert string to float: '#DIV/0!'
Searching on StackOverFlow I found that it may be occurring because I am trying to divide by zero. To avoid it, I multiplied every value in the dataframe to 100 and I then substituted every 0 with a 1 instead: given the scale of the new values, that 1 would have been irrelevant, or at least this is what I thought. The code that I used is:
df = df.mul(100)
df = df.replace(0, 1)
What happens is that if I now try to fit my RF I get a new error:
ValueError: could not convert string to float: '-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932-30.68464932'
I am 100% sure that I'm not using any string as a value in my dataset.
Here's a little sample:
So my question now becomes: how to fix this problem?
EDIT
By using "df.info" I discovered that there was an object. I solved this issue with the following one-liner:
df = df.apply(lambda col:pd.to_numeric(col, errors='coerce'))
Now all the values are in the format "float64".
The problem is that now I get a new error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Alright, with further research I found a second one-linear which solved my problem: now the fitting is successful.
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]

Running sklearns label encoder on all columns at once

Image of ull error
I am trying to run LabelEncoder on all columns that are of type object. This is the code I wrote but it throws this error:
TypeError: '<' not supported between instances of 'int' and 'str'
Does anybody know how to fix this?
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])

Looks like it has different types while appending. You try converting all to str at fit method:
le.fit(data.values.astype(str))
And you have to change your data type to str for transform as well since the classes in LabelEncoder will be str:
X_train[col]=le.transform(X_train[col].astype(str))
X_test[col]=le.transform(X_test[col].astype(str))
Trying to recreate similar problem. If dataframe has values with int and str:
import pandas as pd
df = pd.DataFrame({'col1':["tokyo", 1 , "paris"]})
print(df)
Result:
col1
0 tokyo
1 1
2 paris
Now, using Labelenconder would give similar error message i.e. TypeError: unorderable types: int() < str() :
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values)
Converting all to str in fit or before may resolve issue:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values.astype(str))
print(le.classes_)
Result:
['1' 'paris' 'tokyo']
If you just call le.transform(df.col1), it will throw similar error again.
So, it has to be le.transform(df.col1.astype(str)) instead.

The error is basically telling you the exact problem: some of the values are strings and some are not. You can solve this by calling c.astype(str) each time you call fit, fit_transform, or transform, on Series c, e.g.:
le.fit(data.values.astype(str))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Binarize categorical column within pipeline - python

Related

Pipeline with SimpleImputer and OneHotEncoder - how to do properly?

TypeError while using label encoder

ValueError during LogisticRegression

sklearn, ValueError: could not convert string to float, even if I'm not using strings

Running sklearns label encoder on all columns at once

Categories

Resources