Xgboost: How to convert prediction probabilities to multiclass labels original names? - python

I am using the xgboost multiclass classifier as outlined in the example below. For each row in the X_test dataframe the model outputs a list with the list elements being the probability corresponding to each category 'a','b','c' or 'd' e.g. [0.44767836 0.2043365 0.15775423 0.19023092].
How can I tell which element in the list corresponds to which class / cateogry (a,b,c or d)? My goal is to create 4 extra columns on the dataframe a,b,c,d with the matching probability as the row value in each column.
import numpy as np
import pandas as pd
import xgboost as xgb
import random
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#Create Example Data
np.random.seed(312)
data = np.random.random((10000, 3))
y = [random.choice('abcd') for _ in range(data.shape[0])]
features = ["x1", "x2", "x3"]
df = pd.DataFrame(data=data, columns=features)
df['y']=y
#Encode target variable
labelencoder = preprocessing.LabelEncoder()
df['y_target'] = labelencoder.fit_transform(df['y'])
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(df[features], df['y_target'], test_size=0.2, random_state=42, stratify=y)
#Train Model
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = { 'objective':'multi:softprob',
'random_state': 20,
'tree_method': 'gpu_hist',
'num_class':4
}
xgb_model = xgb.train(param, dtrain, 100)
predictions=xgb_model.predict(dtest)
print(predictions)

Predictions follow the same order as your column labels 0, 1, 2, 3. To get original target names use the classes_ attribute from LabelEncoder.
import pandas as pd
pd.DataFrame(predictions, columns=labelencoder.classes_)
>>>
a b c d
0 0.133130 0.214460 0.569207 0.083203
1 0.232991 0.275813 0.237639 0.253557
2 0.163103 0.248531 0.114013 0.474352
3 0.296990 0.202413 0.157542 0.343054
4 0.199861 0.460732 0.228247 0.111159
...
1995 0.021859 0.460219 0.235214 0.282708
1996 0.145394 0.182243 0.225992 0.446370
1997 0.128586 0.318980 0.237229 0.315205
1998 0.250899 0.257968 0.274477 0.216657
1999 0.252377 0.236990 0.221835 0.288798

Related

How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?

I would like to run a K-fold cross validation on my data using a classifier. I want to include the prediction (or predicted probability) columns for each sample directly into the initial dataset/dataframe. Any ideas?
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train, y_train)
pred_values = model.predict(X_test)
predict_prob = model.predict_proba(X_test.values)[:,1]
auroc = roc_auc_score(y_test, predict_prob)
acc = accuracy_score(pred_values , y_test)
auroc_score.append(auroc)
acc_score.append(acc)
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))
Given this code, how could I begin to generate such an idea: add a prediction column or, even better, the prediction probability columns for each sample within the initial dataset?
In 10-fold cross-validation, each example (sample) will be used exactly once in a test set and 9 times in a training set. So, after 10-fold cross-validation, the result should be a dataframe where I would have the predicted class for ALL examples in the dataset. Each example will be assigned its initial features, its labelled class, and the class predicted computed in the cross-validation fold where that example was used in the test set.
You can use cross_val_predict, see help page, it basically returns you the cross validated estimates:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
import pandas as pd
X,y = make_classification()
df = pd.DataFrame(X,columns = ["feature{:02d}".format(i) for i in range(X.shape[1])])
df['label'] = y
df['pred'] = cross_val_predict(LogisticRegression(), X, y, cv=KFold(5))
You can use the .loc method to accomplish this. This question has a nice answer that shows how to use it: df.loc[index_position, "column_name"] = some_value
So, an edited version of the code you posted (I needed data, and removed auc_roc since we aren't using probabilities per your edit):
from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier
X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
# Create columns
X['Prediction'] = 1
# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train[model_columns], y_train)
pred_values = model.predict(X_test[model_columns])
acc = accuracy_score(pred_values , y_test)
acc_score.append(acc)
# Add values to the dataframe
X.loc[test_index, 'Prediction'] = pred_values
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
# Add label back per question
X['Label'] = y
# Print first 5 rows to show that it works
print(X.head(n=5))
Yields
accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
mean radius mean texture ... Prediction Label
0 17.99 10.38 ... 0 0
1 20.57 17.77 ... 0 0
2 19.69 21.25 ... 0 0
3 11.42 20.38 ... 1 0
4 20.29 14.34 ... 0 0
[5 rows x 32 columns]
(Obviously the model/values etc are all arbitrary)

Scikit-Learn Numpy - Use One Hot Encoder on only string or categorical columns in dataset

I have a simple linear regression model below that uses one hot encoding to transform every X value. My question is how can I modify the code below to use one hot encoding for every column except one (e.g. the integer one highlighted below)
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
# one-hot encode input variables that are objects
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
I tried only feeding in 8 columns instead of 9 to OHE but got the error:
ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 9 features and the X has 8 features.

classification metrics cannot handle a mix of binary and continuous targets -Python random forests

My input data file is under the form:
gold, callersAtLeast1T, CalleesAtLeast1T, ...
T,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
I am trying to predict the first column (gold) based on the values of the remaining columns, here is the code that I am using:
import pandas as pd
import numpy as np
dataset = pd.read_csv( 'data1extended.txt', sep= ',')
#convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
print(dataset.head())
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
The 3 last lines of my code cause an error, how to fix it?
This line causes the error:
print(confusion_matrix(y_test,y_pred))
I printed y_test and y_pred and here is what I obtained:
y_test is: [0 0 0 ... 0 0 0]
y_pred is: [0.0007123 0.00402548 0.00402548 ... 0.00402548 0.02651928 0.00816086]
You're using RandomForestRegressor which outputs continuous value output i.e. a real number whereas confusion matrix is expecting a category value output i.e. discrete number output 0,1,2 and so on.
Since you're trying to predict classes i.e. either 1 or 0 you can do two things:
1.) Use RandomForestClassifier instead of RandomForestRegressor which will output 0 or 1 and you can use it for getting your metrics. (Recommended)
2.) If you want real valued output only, you can set a threshold i.e.
y_pred = (y_pred < threshold).astype(int)
This'll transform your output real number to 1 if the number is less than threshold else 1 and use it for getting your metrics.

Simple Imputation model n_feature and input n_feature not matching

I am trying to learn Simple Imputer on the data set provided on the course tab on Kaggle - https://www.kaggle.com/alexisbcook/missing-values
CSV file is available on above link.
While trying out the code I am getting following error.
ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 9
Any help to sort out the issue will be appreciated.
My Code:
import pandas as pd
df0 = pd.read_csv('/Users/ratnam03chanakya/Desktop/Projects/Kaggle/02.melb_data/melb_data.csv')
df0.head()
y = df0.Price
features = ['Rooms', 'Distance', 'Bathroom', 'Car', 'Landsize', 'BuildingArea','YearBuilt', 'Lattitude', 'Longtitude']
X = df0[features]
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X,y,random_state=0)
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
# Drop columns in training and validation data model_selection
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
def score_dataset(X_train, X_valid, y_train, y_valid):
model0 = RandomForestRegressor()
model0.fit(reduced_X_train, y_train)
model0_predict = model0.predict(X_valid)
mae = mean_absolute_error(y_valid,model0_predict)
return mae
print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
# IMPUTATION
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train,imputed_X_valid,y_train,y_valid))

import csv to OrderedDict and predict using regression

i build a regression model to predict energy ( 1 columns ) from 5 variables ( 5 columns ) ... i used my exprimental data to train and fit the model and it works with good score ...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('new.csv')
X = data.drop(['E'],1)
y = data['E']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5 ,
random_state=2)
from sklearn import ensemble
clf1 = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth =5,
min_samples_split = 2, loss='ls',
learning_rate = 0.1)
clf1.fit(X_train, y_train)
clf1.score(X_test, y_test)
but now i want to add a new csv file contain new data for mentioned 5 variables to OrderedDict and use the model to predict energy ...
with code bellow i manually insert row by row and it predict energy correctly
from collections import OrderedDict
new_data = OrderedDict([('H',48.52512), ('A',169.8379), ('P',55.52512),
('R',3.058758), ('Q',2038.055)])
new_data = pd.Series(new_data)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but i cant do this with huge datasets and need to import csv file ... i do the bellow but cant figure it out ....
data_2 = pd.read_csv('new2.csv')
X_new = OrderedDict(data_2)
new_data = pd.Series(X_new)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but gives me : ValueError: setting an array element with a sequence.
can anyone help me ??

Categories