I have a csv file, and I'm preparing it's data to be trained using different machine learning algorithms, so I replaced numeric missing data with the mean of that column, but how to deal with missing categorical data, should I replace them with the most frequent element? and what the easiest why to do it in python using pandas.
Code:
dataset = pd.read_csv('doc.csv')
X = dataset.iloc[:, [2, 4, 5, 6, 7, 9,10 ,11]].values
y = dataset.iloc[:, -1].values
Row number 2 contain the categorical data.
first row value :
[3, 'S', 22.0, 1, 0, 7.25, 107722, 2]
Regarding the modelling part of your question, you're better off asking that at CrossValidated.
If there are too many records with missing data, you could just remove that column from consideration altogether. There are some other excellent suggestions on this StackOverflow post, including sci-kit learn's Imputer() method, or just letting the model handle the missing data.
Regarding replacing a column look into the DataFrame.replace() method
DataFrame.replace(
to_replace=None,
value=None,
inplace=False,
limit=None,
regex=False,
method='pad',
axis=None)
An example usage of this for your dataset, assuming that the missing column values are called 'N' and you are replacing them by some other category 'S' (which you found out using the DataFrame.mode() method): dataset[1].replace('N', 'S').
Related
while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
else:
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
value=df[groupby].map(
{1: 38, 2: 30, 3: 25})
)
return df
tfmr = FunctionTransformer(
impute_age_class,
validate=False,
kw_args={'fillme': 'age', 'groupby': 'pclass'}
)
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.
I facing a challenge to create a pipeline to impute (SI) a category variable (eg colour) and then onehotencode (OHE) 2 variables (eg colour & dayofweek). colour is used in the 2 steps.
I wanted to put SI and OHE in 1 ColumnTransformer. I just learnt that both SI and OHE running in parallel, meaning OHE will not encode the imputed colour (ie OHE the original un-imputed colour.)
I then tried:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False,drop='first')
ctsi = ColumnTransformer(transformers=[('si',si,['colour'])],remainder='passthrough')
ctohe = ColumnTransformer(transformers=[('ohe',ohe,['colour','dayofweek'])],remainder='passthrough')
pl = Pipeline([('ctsi',ctsi),('ctohe',ctohe)])
outfit = pl.fit_transform(X,y)
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I believe it's because the column name colour has been removed by SI. When I change the OHE columns to a list of int:
ctohe = ColumnTransformer(transformers=[('ohe',ohe,[0,1])],remainder='passthrough')
It goes through. I'm just testing the processing, obviously, the columns are incorrect.
So my challenge here is that given what I want to accomplish, is it possible ? And how can I do that ?
Great many thanks in advance !
Actually, I agree with your reasoning. The problem is given by the fact that ColumnTransformer forgets the column names after the transformation and indeed - quoting the answer in here - ColumnTransformer's intended usage is to deal with transformations applied in parallel. That's also specified in the doc by means of this sentence in my opinion:
This estimator allows different columns or column subsets of the input to be transformed separately. [...] This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
I guess that one solution to this might be to go with a custom renaming of the columns, passing a callable to the columns portion of ColumnTransformer's transformers tuple (name, transformer, columns) (notation which follows the documentation) according to your needs (actually, this would work I guess if you pass a callable to your second ColumnTransformer instance in the pipeline).
EDIT: I have to withdraw somehow what I wrote, I'm not actually sure that passing a callable to columns might work for your need as your problem does not really stand in column selection per se, but rather in column selection via string column names, for which you would need a DataFrame (and imo acting on column selector only won't solve such problem).
Instead, you might better need a transformer that somehow changes column names after the imputation and before the one-hot-encoding (still provided that the setting is not the ideal one when different instances of ColumnTransformer have to transform the same variables in sequence in a Pipeline) acting on a DataFrame.
Actually a couple of months ago, the following https://github.com/scikit-learn/scikit-learn/pull/21078 was merged; I suspect it is not still in the latest release because by upgrading sklearn I couldn't get it to work. Anyway, IMO, in the future it may ease in similar situations as it adds get_feature_names_out() to SimpleImputer and get_feature_names_out() is in turn really useful when dealing with column names.
In general, I would also suggest the same post linked above for further details.
Eventually, here's a naive example I could get to; it's not scalable (I tried to get to something more scalable exploiting feature_names_in_ attribute of the fitted SimpleImputer instance, without arriving to a consistent result) but hopefully might give some hints.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.NaN],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
ct_1 = ColumnTransformer([('si', SimpleImputer(strategy='most_frequent'), ['city'])],
remainder='passthrough')
ct_2 = ColumnTransformer([('ohe', OneHotEncoder(), ['city'])], remainder='passthrough', verbose_feature_names_out=True)
class ColumnExtractor(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def transform(self, X, *_):
return pd.DataFrame(X, columns=self.columns)
def fit(self, *_):
return self
pipe = Pipeline([
('ct_1', ct_1),
('ce', ColumnExtractor(['city', 'title', 'expert_rating', 'user_rating'])),
('ct_2', ct_2)
])
pipe.fit_transform(X)
I think the simplest option (for now) is separate pipelines for the two columns:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False, drop='first'))
siohe = Pipeline([
('si', si),
('ohe', ohe),
])
ct = ColumnTransformer(
transformers=[
('si', siohe, ['colour']),
('siohe', ohe, ['dayofweek']),
],
remainder='passthrough'
)
(You specified the strategy as 'mean', which is probably not what you want, but I've left it in the code above.)
You might also consider just using siohe for both columns: if there are no missings in dayofweek (in production data!) then there's no real difference.
I have a CSV file containing data: (just the first ten rows of data are listed)
0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84
The first column indicates the row number (e.g the first column in the first row is 0). When i try to use
import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')
Error occurs as below:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12
I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.
Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.
For your first problem, you can pass a names=... parameter to read_csv:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers
2.) This is for a problem with 120 classes
3.) You want a matrix with 1s and 0s for each class
4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlaps
for k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as #cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.
How does pandas categorical https://pandas.pydata.org/pandas-docs/stable/categorical.html handle new and unseen levels? I am thinking about a scikit-learn like setup. Currently, I have something like:
https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce
def: fit()
for each column:
fit a label encoder:
def: transform()
for each column:
check if column was unseen
yes(unseen) replace
no: label encode
but this is pretty slow.
Apparently, decision trees like xgboost or lightbm can directly handle categorical data, i.e. one would not need to fiddle around manually with this slow conversion.
But when looking at their code
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 they seem to use LGBMLabelEncoderwhich is a standard scikit-learn LabelEncoder.
I wonder how that can handle unseen data.
If a manual conversion is required would pandas.Categorical allow a quicker conversion - even if unseen levels are in the new data?
edit
Please see https://github.com/geoHeil/pythonQuestions/blob/master/categorical-encoding.ipynb for an overview how I could not get scikit-learn's usual suspects to work.
Still looking for something more performant than my solution. Also lightGBM https://github.com/Microsoft/LightGBM/issues/789 suggests to use custom encoding strategy.
There might be a pandas solutin, but it works probably best with sklearns LabelBinarizer
from sklearn.preprocessing import LabelBinarizer
df= pd.DataFrame({'A':['a','b','c','a']})
lb = LabelBinarizer()
lb.fit(df["A"])
lb.transform(df["A"])
[[1 0 0]
[0 1 0]
[0 0 1]
[1 0 0]]
df2 = pd.DataFrame({'A':['a','b','d']})
lb.transform(df2['A'])
[[1 0 0]
[0 1 0]
[0 0 0]]
So we see that 'd' is essentially mapped to neither 'a','b' or 'c'.
Note however, that there is a bug which probably will be resolved in one of the next sklearn releases.
The LabelBinarizer is fit during training and recalls the values passed to it. New values get mapped to all zeros. It might be more feasible do write a transformer (as seen here before the edit) using pandas get_dummies.
This could be quite straightforward due to name matching of columns. Fit in the first step and store the column names, than just transform in the transformstep, but only keep column names that you identified in fitting (potentially adding zome zero columns if training levels are not present in the test set). Then you are done ;)
I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
A subset of string type(the column-features 1, 2, 3)
A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)
Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively.
In this context I have to encode them to use support vector machine algorithm.
This is the code that I have:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction
df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40] # select columns 1 through 41 (the features)
y = datanumpy[:, 41] # select column 42 (the labels)
I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have.
Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?
EDIT
With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder)
The problem persist for subset of string type with high cardinality.
This is by far the easiest:
df = pd.get_dummies(df, drop_first=True)
If you get a memory overflow or it is too slow then reduce the cardinality:
top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"
As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.
And performance-wise I think One Hot Encoder is much better than DictVectorizer.
You can use the pandasmethod .get_dummies() as suggested by #simon here above, or you can use the sklearn equivalent given by OneHotEncoder.
I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).
If, for some features, the cardinality is too big, impose low n_values.
If you have enough memory don't worry, encode all the values of your features.
I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.