How to handle categorical variables in sklearn GradientBoostingClassifier? - python

I am attempting to train models with GradientBoostingClassifier using categorical variables.
The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
X_train = pandas.DataFrame(X_train)
# Insert fake categorical variable.
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40
# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
The following error appears:
ValueError: could not convert string to float: 'b'
From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model.
Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding?
R gbm package is capable of handling the sample data above. I'm looking for a Python library with equivalent capability.

pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.
Below is the example code from the question with the above procedure carried out.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.
# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)
catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################
# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
prob = clf.predict_proba(X_test)[:,1] # Only look at P(y==1).
fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)
print(prob)
print(y_test)
print(roc_auc_prob)
Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.

Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should).

Related

Scikit-Learn Numpy - Use One Hot Encoder on only string or categorical columns in dataset

I have a simple linear regression model below that uses one hot encoding to transform every X value. My question is how can I modify the code below to use one hot encoding for every column except one (e.g. the integer one highlighted below)
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
# one-hot encode input variables that are objects
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
I tried only feeding in 8 columns instead of 9 to OHE but got the error:
ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 9 features and the X has 8 features.

Complement Naive Bayes and weighted class in sklearn

I'm trying to implement a complement naive bayes classifier using sklearn. My data have very imbalanced classes (30k samples of class 0 and 6k samples of the 1 class) and I'm trying to compensate this using weighted class.
Here is the shape of my dataset:
enter image description here
I tried to use the compute compute_class_weight function to calcute the weights and then pass it to the fit function when training my model:
import numpy as np
import seaborn as sn
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.naive_bayes import ComplementNB
#Import the csv data
data = pd.read_csv('output_pt900.csv')
#Create the header of the csv file
header = []
for x in range(0,2500):
header.append('pixel' + str(x))
header.append('status')
#Add the header to the csv data
data.columns = header
#Replace the b's and the f's in the status column by 0 and 1
data['status'] = data['status'].replace('b',0)
data['status'] = data['status'].replace('f',1)
print(data)
#Drop the NaN values
data = data.dropna()
#Separate the features variables and the status
y = data['status']
x = data.drop('status',axis=1)
#Split the original dataset into two other: train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
all_together = y_train.to_numpy()
unique_classes = np.unique(all_together)
c_w = class_weight.compute_class_weight('balanced', unique_classes, all_together)
clf = ComplementNB()
clf.fit(x_train,y_train, c_w)
y_predict = clf.predict(x_test)
cm = confusion_matrix(y_test, y_predict)
svm = sn.heatmap(cm, cmap='Blues', annot=True, fmt='g')
figure=svm.get_figure()
figure.savefig('confusion_matrix_cnb.png', dpi=400)
plt.show()
but I got thesse error:
ValueError: sample_weight.shape == (2,), expected (29752,)!
Anyone knows how to use weighted class in sklearn models?
compute_class_weight returns an array of length equal to the number of unique classes with the weight to assign to instances of each class (link). So if there are 2 unique classes, c_w has length 2, containing the weight that should be assigned to samples with label 0 and 1.
When calling fit for your model, the weight for each sample is expected by the sample_weight argument. This should explain the error you received. To solve this issue, you need to use c_w returned by compute_class_weight to create an array of individual sample weights. You could do this with [c_w[i] for i in all_together]. Your fit call would ultimately look something like:
clf.fit(x_train, y_train, sample_weight=[c_w[i] for i in all_together])

Having all my predictions inclined to one side for binary classification

I was training a model that contains 8 features that allow us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking a value between 1 and 10)
Date: The date of stay (an integer between 1‐365, here we consider only one‐day request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept: Whether this post gets accepted (someone took it, 1) or not (0) in the end*
Column Accept is the "y". Hence, this is a binary classification.
I have done OneHotEncoder for the categorical data.
I have applied normalization to the data.
I have modified the following RandomRofrest parameters:
Max_depth: Peak at 16
n_estimators: Peak at 300
min_samples_leaf:
Peak at 2
max_features: Has no effect on the AUC.
The AUC peaked at 0.7889. What else can I do to increase it?
Here is my code
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
df_train = pd.read_csv('case2_training.csv')
# Exclude ID since it is not a feature
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05,shuffle=False)
ohe = OneHotEncoder(sparse = False)
column_trans = make_column_transformer(
(OneHotEncoder(),['Region','Weekday','Apartment']),remainder='passthrough')
X_train = column_trans.fit_transform(X_train)
X_test = column_trans.fit_transform(X_test)
# Normalization
from sklearn.preprocessing import MaxAbsScaler
mabsc = MaxAbsScaler()
X_train = mabsc.fit_transform(X_train)
X_test = mabsc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
RF = RandomForestClassifier(min_samples_leaf=2,random_state=0, n_estimators=300,max_depth = 16,n_jobs=-1,oob_score=True,max_features=i)
cross_val_score(RF,X_train,y_train,cv=5,scoring = 'roc_auc').mean()
RF.fit(X_train, y_train)
yhat = RF.predict_proba(X_test)
print("AUC:",roc_auc_score(y_test, yhat[:,-1]))
# Run the prediction on the given test set.
testset = pd.read_csv('case2_testing.csv')
testset = testset.iloc[:, 1:] # exclude the 'ID' column
testset = column_trans.fit_transform(testset)
testset = mabsc.transform(testset)
yhat_2 = RF.predict_proba(testset)
final_prediction = yhat[:,-1]
However, all the probabilities from 'final_prediction` are below 0.45, basically, the model believes that all the samples are 0.
Can anyone help ?
You are using column_trans.fit_transform on the test set, this completely overwrites the features that were fitted during training. Basically the data is now in a format your trained model doesn't understand.
Once fitted in training on the training set, simply use column_trans.transform afterwards.

Sklearn DecisionTreeClassifier F-Score Different Results with Each run

I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

multi-class logistic regression using sklearn (representing y as multi-class)

I'm working on a what I thought was a fairly simple machine learning problem.
In this problem the y (label) I'm wanting to classify is a multi-class value. In this dataset I have 6 possible choices.
I've been using the preprocessing.LabelBinarizer() function to pivot my y set to an array of ones or zeros in hopes that this would be sufficient (e.g. [0 0 0 0 0 1]).
This code below fails on the model.fit() due to a ValueError: Found arrays with inconsistent numbers of samples: [ 217 1302] || 1302 is 217*6 BTW
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
It seems that the binarizer returns results that appear like 6 columns to the algorithm instead of 1 column containing an array of 6 digits.
I've tried to force it into an array model using the code below but then the fit function bails for another reason: ValueError: Unknown label type array([array[0,1,0,0,0]), arrary([0,1,0,0...])
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y_list = []
for x in api_y:
item = {'gear': np.array(x)}
y_list.append(item)
y = pd.DataFrame(y_list)
print("after changing to binary classes array y is "+repr(y.shape))
y = np.ravel(y)
I also tried the sklearn_pandas.DataFrameMapper to no avail as it also created 6 distinct fields vs. an array of 6 values represented as one field.
Any help or suggestions would be appreciated...full version of what I thought was right posted here for clarity:
#!/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
import sklearn_pandas
#
# load traing data taken from 2 years of strava rides
df = pd.DataFrame.from_csv("gear_train.csv")
#
# Prepare data for logistic regression
#
y, X = dmatrices('gear ~ distance + moving_time + total_elevation_gain + average_speed + max_speed + average_cadence + has_heartrate + device_watts', df, return_type="dataframe")
#
# Fix up y to be a flattened array of 1 column (binary array?)
#
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
#
# run the logistic regression
#
model = LogisticRegression()
model = model.fit(X, y)
score = model.score(X, y)
#
# evaluate the model by splitting into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
predicted = model2.predict(X_test)
print("predicted="+repr(lb2.inverse_transform(predicted)))
print(metrics.classification_report(y_test, predicted))
#
# do a 10-fold CV test to see if this model holds up
#
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores.mean())enter code here
The root cause of my problem was y field contained string values instead of numeric. For example b12345 as a key instead of 12345. Once I changed that to use LabelEncoding and Decoding it worked like a champ.

Categories