ROC curve for multi-class classification without one vs all in python - python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Related

how to build an artificial neural network with scikit-learn?

I am trying to run an artificial neural network with scikit-learn.
I want to run the regression, get the model fit results, an generate out of sample forecasts.
This is my code below. Any help will be greatly appreciated.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
#import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
#view the data
df.head(5)
#to drop a column of data type
df2=df.drop('Unnamed: 13', axis=1)
#view the data
df2.head(5)
Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
describe the data
df.describe().transpose()
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
set the X and Y
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
import MLP Classifier and fit the network
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)
predict_train = mlp.predict(X_train)
set up the MLP Classifier
mlp = MLPClassifier(
hidden_layer_sizes=(50, 8),
max_iter=15,
alpha=1e-4,
solver="sgd",
verbose=True,
random_state=1,
learning_rate_init=0.1)
import the warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
predict_test = mlp.predict(X_test)
to train on the data I use the MLPClassifier to call the fit function on the training data.
mlp.fit(X_train, y_train)
after this, the neural network is done training.
after the neural network is trained, the next step is to test it.
print out the model scores
print(f"Training set score: {mlp.score(X_train, y_train)}")
print(f"Test set score: {mlp.score(X_test, y_test)}")
y_predict = mlp.predict(X_train)
I am getting an error from below
x_ann = y_predict[:, 0]
y_ann = y_predict[:, 1]
The error message is
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
any help will be greatly appreciated
predict function gives you the actual class and since your point can belong to one and only one class (except multi label), it is supposed to be like this only
What is the shape of your Y_true_labels? Might be the case that your labels are Sparse and with 2 classes, means 0,1 and since the models is minimising Log Loss as described here as:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Also looking at the predict() it says:
log_y_probndarray of shape (n_samples, n_classes)
The predicted log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_. Equivalent to log(predict_proba(X))
So it means that if probability is 0.3 it means it belongs to class and if it's 0.7 it belongs to class, ASSUMING it's binary classification with a threshold set to 0.5.
What you might be confusing with is the predict_proba() function which gives you the probabilities for each classes.
Might be the case. Please post your X,Y data shape and type so that we can understand better.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

How to get multi-class roc_auc in cross validate in sklearn?

I have a classification problem where I want to get the roc_auc value using cross_validate in sklearn. My code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))
However, I get the following error.
ValueError: multiclass format is not supported
Please note that I selected roc_auc specifically is that it supports both binary and multiclass classification as mentioned in: https://scikit-learn.org/stable/modules/model_evaluation.html
I have binary classification dataset too. Please let me know how to resolve this error.
I am happy to provide more details if needed.
By default multi_class='raise' so you need explicitly to change this.
From the docs:
multi_class {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’
Multiclass only. Determines the type of configuration to use. The
default value raises an error, so either 'ovr' or 'ovo' must be passed
explicitly.
'ovr':
Computes the AUC of each class against the rest [3] [4]. This treats
the multiclass case in the same way as the multilabel case. Sensitive
to class imbalance even when average == 'macro', because class
imbalance affects the composition of each of the ‘rest’ groupings.
'ovo':
Computes the average AUC of all possible pairwise combinations of
classes [5]. Insensitive to class imbalance when average == 'macro'.
Solution:
Use make_scorer (docs):
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, multi_class='ovo',needs_proba=True)
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = myscore)

How to calculate r-squared with python?

I have fitted a model from which I'd like to know the scores (r-squared).
The data is split into a training and testing set. Although the model is only trained using the training set, how is it possible that my r-squared for my testing data is higher? I mean the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong?
My code:
import pandas as pd
import numpy
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression
from scipy import stats
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
df=pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DA0101EN/module_5_auto.csv")
df=df._get_numeric_data()
y_data = df['price']
x_data=df.drop('price',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=1)
lr=LinearRegression()
lr.fit(x_train[['horsepower']], y_train)
h=lr.score(x_train[['horsepower']], y_train).mean()
h2=lr.score(x_test[['horsepower']], y_test).mean()
print(h,h2)
To put it simply, R-Squared is used to find the 'difference in percent' or calculate the accuracy of two time-series datasets.
Formula
Note: squaring Pearsons-r, squaring pandas corr(), or r^2 have slightly different results than R^2 formula shown above, this is due to 'statistic round up' reasons... refer to Max Pierini's answer
SciKit Learn R-squared is very different from square of Pearson's Correlation R
Method 1: function
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
Method 2: sklearn library
from sklearn.metrics import r2_score
r2 = r2_score(actual, predicted)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
It looks like you're using scikit-learn. If so, you can use the r2_score metric.

How can I create a Linear Regression Model from a split dataset?

I've just split my data into a training and testing set and my plan is to train a Linear Regression model and be able to check what the performance is like using my testing split.
My current code is:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
split = np.random.rand(len(df)) <= 0.75
training_set = df[split]
testing_set = df[~split]
Is there a proper method I should be using to plot a Linear Regression model from an external file such as a .csv?
Since you want to use scikit-learn, here's an approach using sklearn.linear_model.LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, y_train = training_set[x_vars], training_set[y_var]
X_test, y_test = testing_test[x_vars], testing_test[y_var]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Depending on whether you need more descriptive output, you might also look into use statsmodels for linear regression.

Categories