I want to use Random Forest for feature selection based on Gini index. My dataset has mix of numeric (contiuous) and categorical(String) data. This is an example of the dataset
Var1 Var2
198 zcROj17IEC
336 DHeTmBftjz
252.3 crIgUHSK8h
252 ZSNrjIX0Db
I know trees works on discrete data (categorical) but does RandomForest in Sklearn require continuous numeric data to be discretized first or it can handle it?? For categorical string variables I used the following to encode the strings into numeric columns with zeros and ones
pandas.get_dummies(X['Var2'])
and it works but for the numeric I tried the following to discretize
pandas.qcut(X['Var1'], 2 , retbins=True)
but I keep getting an error of non-unique bins!
Do I need to discretize? How can I do it?
Random forest should support continuous variables no problem. See for example this sample.
Trees and Forest work worse when you make dummies from you categorical values.
You need just label you categorical features - that's all!
Related
I have the House Prices - Advanced Regression Techniques Data set. I need to do Lasso and Ridge Regularization on it. I saved the train data in the variable named house. Typed the following code:
house.info()
Got this output: enter image description here
There are columns in this data set which are numerical(int64 and float 64) but they actually are categorical(both ordinal and nominal).
I wanted to ask whether I can standardize these categorical variables or should I first convert all these variables into type "object" using house[col_name]=house[col_name].astype(str) and then do one- hot encoding on these variables and standardize the rest of the numerical columns?
When a column is cardinal it is possible to apply one-hot-encoding, in this way the categorical columns can be vectorized in a binary way for each category.
import pandas as pd
raw_df= pd.get_dummies(data=raw_df,
cardinal_features=['col1', 'col2', 'col3'],
prefix=['feature1_', 'feature2_', 'feature3_'])
I'm having trouble with my regression tree using the sklearn package. It's about a book dataset, in which the regression tree can be seen below:
The problem is in the STORY_LANGUAGE variable. This is a categorical variable with the values 0, 1, 2, and 3, which all correspond to a different language of the book. Before running the model, I've made sure that STORY_LANGUAGE is a categorical variable, yet the tree still splits it and treats it as a float (1.5).
How should I solve this? Any help is appreciated!
By passing a list of integers as features to scikit-learn, you're telling it that there's some sort dependence between the features. That e.g. 0 is closer related to 1 than to 2. To get around this, you will need to do one-hot encoding with the built-in OneHotEncoder. If you have three categories, 0, 1 and 2, a 0 will be converted to [1,0,0], while a 1 will be converted to [0,1,0]. Basically your one feature is replaced with a vector that is equal to 1 at a position corresponding to which class it is and 0 otherwise.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# Generate random integers between 0 and 2
x = np.random.randint(0,3, size=(100,1))
# Create the one-hot encoder object, specifying not to use sparse arrays.
m = OneHotEncoder(sparse=False)
# Transform your features
x_one_hot = m.fit_transform(x)
If you are using sklearn's DecisionTreeRegressor, your label encoded features will simply be treated as numerical features. If you want them to be treated as categories, you can either perform one-hot-encoding (e.g. using OneHotEncoder) or use an algorithm that supports categorical features out of the box (e.g. lightGBM).
I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output
There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features
I would like to change my feature 'Age' from a continuous variable, to a categorical variable of age ranges for binary classification, like this:
df['Age'] = pd.cut(df['Age'], [0,6,12,16,65,90] ,labels=['0-6','6-12','12-16','16-65','65-90'])
However I want to split it in the optimal way, so that the data can be classified most efficiently. i.e the variance of classes within the age ranges is minimised while not overfitting.
Is there a package which has a method, that can minimise variance when splitting data like this, or do I have to write one myself?
Maybe you can use sklearn.cluster to do this.
I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
ValueError: Feature out of bounds. Try setting n_values.
I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.
Thanks.
Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.
This feature is added to OneHotEncoder now. You can do this by setting the parameter handle_unknown='ignore'.
For example:
from sklearn.preprocessing import OneHotEncoder
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
onehot = OneHotEncoder(handle_unknown='ignore')
A = onehot.fit_transform(A)
B = onehot.transform(B)