Convert Categorical features to Numerical - python

I have a lot of categorical columns and want to convert values in those columns to numerical values so that I will be able to apply ML model.
Now by data looks something like below.
Column 1- Good/bad/poor/not reported
column 2- Red/amber/green
column 3- 1/2/3
column 4- Yes/No
Now I have already assigned numerical values of 1,2,3,4 to good, bad, poor, not reported in column 1 .
So, now can I give the same numerical values like 1,2,3 to red,green, amber etc in column 2 and in a similar fashion to other columns or will doing that confuse model when I implement it

You can do this for some of the rated columns by using df[colname].map({})or LabelEncoder() .
They will change each categorical data to numbers, so there is a weight between them, which means if poor is one and good is 3, as you can see, there is a difference between them. You want the model to know it, but if it's just something like colors, you know there is no preference in colors, and green is no different from blue .so it is better not to use the same method and use get_dummies in pandas.

The colour values you mention are nominal. There is no ranking or order to these values. If you assign 1,2,3 etc the data can be misrepresented as being from a scale.
To avoid this you can transform them by using the onehotencoder technique. This effectively encodes a multi value categorical field into the following:
red = 100
amber = 010
green = 001
You can use the following library from sk-learn.preprocessing:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Related

Classify variables in Nominal/ordinal/interval/binary in case user inputs not provided?

If there is no predefined column types(nominal/interval) stored and some of variables are encoded as 1,2,3... in place of actual Categories (e.g. Good, better, bad....) if we see, automatically it may be classified as interval variables but actually they are nominal variables that are encoded.
Is there any way to identify such variables?
I thought of cardinality but threshold becomes an issue here please suggest some other solution.
I'm good with python solution but if someone can give idea on SAS will be helpful :)
as a Data Analyst, its your call to consider the categorical column as nominal or ordinal (depending on the data).
if nominal data --> use dummy variable.(or one hot encoding)
if ordinal data --> use map() function for label-encoding.
if nominal data and cardinality is high --> encoding according to frequency count (lets say there are 30 different categories in a column, there are 1000 rows , 3 categories have high frequency count ,so these will be in separate 3 categories, other 17 have very low, so put all these 17 in 1 single category. ie. There will be only 4 categories, not 30).
apart from object type(string) columns, To identify categorical variables:
frequency count plays very important role for numeric columns.

Transpose categorical column into multiple columns in python

I really hope you guys can help. I feel like this is a bit complex, so I will try to explain it as good as I can.
I have a large dataset containing final batch numbers and their incoming batch numbers (se picture). Along with this data I have a column called parameter_name, which is the name of the respective value in value/value(1/2/3)/tem_sec_value. value is a cateogrical value, while the others are numerical. If there is a value in value 1/2/3 then it is the same value, and I can just use the first value.
My goal is to transpose the parameter_name column such that the differet parameters are columns with their respective value as rows, while still remaining batch_final, batch_incomming, order_no etc. The batch_incomming is not unique. Nor is the parameter name as another batch_final can have the same parameter_name but with a different value. How do I do this in python or if easier (SQL)?.
The expected output should be as seen in the picture below.

implementing hot deck imputation in python

I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.

How to divide into categories of continuous variables column of dataset in Decision Tree?

When I Have a numeric column like 123 to 189 how can I know that how many groups can I divide this numbers and which threshold is a best option ?
Can anyone explain this ?
Thanks in advance!
Nurlan,well you have given very little information.
So the values of your column range from 123 to 189?
It all depends on the domain and the problem's context.
Say, if I had a column called Marks, ranging from 0-100,
I would try to categorize it into 3 classes - Low,Medium,High.
Maybe also try 4 classes - Low,Average,Medium,High.
And then compare performance in each case.
Another good approach would be to divide them by quartiles.
Pandas has cut and qcut for this purpose.

What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
Check if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
% floats: percentage of values that are float
% int: percentage of values that are whole numbers
% string: percentage of values that are strings
% unique string: number of unique string values / total number
% unique integers: number of unique integer values / total number
mean numerical value (non numerical values considered 0 for this)
std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
I've been looking at this, thought it maybe useful to share what I have. This builds on #Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

Categories