I am newbie in Machine learning. Recently, I have learnt how to calculate confusion_matrix for Test set of KNN Classification. But I do not know, how to calculate confusion_matrix for Training set of KNN Classification?
How can I compute confusion_matrix for Training set of KNN Classification from the following code ?
Following code is for computing confusion_matrix for Test set :
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.
For k-fold cross-validation:
I am also trying to find confusion_matrix for Training set using k-fold cross-validation.
I am confused to this line knn.fit(X_train, y_train).
Whether I will change this line knn.fit(X_train, y_train) ?
Where should I change following code for computing confusion_matrix for training set ?
# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)
skf = StratifiedKFold(y, kfold, random_state = 0)
# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution
# Note: in future versions of scikit.learn, this module will be fused with kfold
skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
skfind[cnt] = train_index
cnt = cnt + 1
# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix
n_neighbors = 1; # better to have this at the start of the code
# 10-fold Cross Validation
for i in range(kfold):
train_indices = skfind[i][0]
test_indices = skfind[i][1]
clf = []
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# fit Training set
clf.fit(X_train,y_train)
# predict Test data
y_predcit_test = []
y_predict_test = clf.predict(X_test) # output is labels and not indices
# Compute confusion matrix
cm = []
cm = confusion_matrix(y_test,y_predict_test)
print(cm)
# conf_mat = conf_mat + cm
You dont have to make much changes
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
Here instead of using X_test we use X_train for classification and then we produce a classification matrix using the predicted classes for the training dataset and the actual classes.
The idea behind a classification matrix is essentially to find out the number of classifications falling into four categories(if y is binary) -
predicted True but actually false
predicted True and actually True
predicted False but actually True
predicted False and actually False
So as long as you have two sets - predicted and actual, you can create the confusion matrix. All you got to do is predict the classes, and use the actual classes to get the confusion matrix.
EDIT
In the cross validation part, you can add a line y_predict_train = clf.predict(X_train) to calculate the confusion matrix for each iteration. You can do this because in the loop, you initialize the clf everytime which basically means reseting your model.
Also, in your code you are finding the confusion matrix each time but you are not storing it anywhere. At the end you'll be left with a cm of just the last test set.
Related
I'm facing an issue where 3 different classifiers, all trained on the same dataset (the sklearn iris dataset), are outputting the exact same accuracy scores and confusion matrices. I've emailed my professor and asked if this was something normal and if she had any advice if it wasn't and all she gave me was basically "it's not normal, go back and look at your code".
I've done a fair bit of looking at my code since then, and I can't seem to see what's going on. I'm hoping someone on here will be able to shed some light on it for me and I'll be able to learn something from this experience.
Here is my code:
# Dataset
from sklearn import datasets
# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Classifiers
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# Performance Metrics
from sklearn.metrics import confusion_matrix, accuracy_score
if __name__ == '__main__':
# Read dataset into memory.
iris = datasets.load_iris()
# Extract independent and dependent variables into variables.
X = iris.data
y = iris.target
# Split training and test sets (70/30).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# Fit the scaler to the training set, and transform both the training and test sets dependent
# columns, which are all of them since none of the dependent variables contain categorical data.
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Create the classifiers.
dt_classifier = DecisionTreeClassifier(random_state=0)
svm_classifier = SVC(kernel='rbf', random_state=0)
lr_classifier = LogisticRegression(random_state=0)
# Fit the classifiers to the training data.
dt_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
lr_classifier.fit(X_train, y_train)
# Predict using the now trained classifiers.
dt_y_pred = dt_classifier.predict(X_test)
svm_y_pred = svm_classifier.predict(X_test)
lr_y_pred = lr_classifier.predict(X_test)
# Create confusion matrices using the predicted results and the actual results from the test set.
dt_cm = confusion_matrix(y_test, dt_y_pred)
svm_cm = confusion_matrix(y_test, svm_y_pred)
lr_cm = confusion_matrix(y_test, lr_y_pred)
# Calculate accuracy scores using the predicted results and the actual results from the test set.
dt_score = accuracy_score(y_test, dt_y_pred)
svm_score = accuracy_score(y_test, svm_y_pred)
lr_score = accuracy_score(y_test, lr_y_pred)
# Print confusion matrices and accuracy scores for each classifier.
print('--- Decision Tree Classifier ---')
print(f'Confusion Matrix:\n{dt_cm}')
print(f'Accuracy Score:{dt_score}\n')
print('--- Support Vector Machine Classifier ---')
print(f'Confusion Matrix:\n{svm_cm}')
print(f'Accuracy Score:{svm_score}\n')
print('--- Logistic Regression Classifier ---')
print(f'Confusion Matrix:\n{lr_cm}')
print(f'Accuracy Score:{lr_score}')
Output:
--- Decision Tree Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Support Vector Machine Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Logistic Regression Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
As you can see, the outputs for each different classifier are exactly the same. Any sort of help that anyone could give me would be greatly appreciated.
There is nothing wrong with your code.
Such similarities in results are not to be unexpected when:
The data are rather "easy"
The sample is too small
Both these premises hold here. The iris data are notoriously easy to classify with modern ML algorithms (including the ones you use here); this, combined with the ridiculously small size of your test set (just 45 samples), make such results unsurprising.
In fact, simply by changing your data split to use a test_size=0.20, you will get a perfect accuracy of 1.0 from all 3 models.
Nothing to worry about.
I am working one a simple KNN model with 3NN to predict a weight,
However, the accuracy is 0.0, I don't know why.
The code can give me a prediction on weight with 58 / 59.
This is the reproducible code
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score
#Create df
data = {"ID":[i for i in range(1,11)],
"Height":[5,5.11,5.6,5.9,4.8,5.8,5.3,5.8,5.5,5.6],
"Age":[45,26,30,34,40,36,19,28,23,32],
"Weight": [77,47,55,59,72,60,40,60,45,58]
}
df = pd.DataFrame(data, columns = [x for x in data.keys()])
print("This is the original df:")
print(df)
#Feature Engineering
df.drop(["ID"], 1, inplace = True)
X = np.array(df.drop(["Weight"],1))
y = np.array(df["Weight"])
#define training and testing
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2)
#Build clf with n =3
clf = neighbors.KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
#accuracy
accuracy = clf.score(X_test, y_test)
print("\n accruacy = ", accuracy)
#Prediction on 11th
ans = np.array([5.5,38])
ans = ans.reshape(1,-1)
prediction = clf.predict(ans)
print("\nThis is the ans: ", prediction)
You are classifying Weight which is a continuous (not a discrete) variable. This should be a regression rather than a classification. Try KNeighborsRegressor.
To evaluate your result, use metrics for regression such as R2 score.
If your score is low, that can mean different things: training set too small, test set too different from training set, regression model not adequate...
I am using the following Python code to make output predictions depending on some values using decision trees based on entropy/gini index. My input data is contained in the file: https://drive.google.com/file/d/1C8GZ2wiqFUW3WuYxyc0G3axgkM1Uwsb6/view?usp=sharing
The first column "gold" in the file contains the output that I am trying to predict (either T or N). The remaining columns represents some 0 or 1 data that I can use to predict the first column. I am using a test set of 30% and a training set of 70%. I am getting the same precision/recall using either entropy or gini index. I am getting a precision of 0.80 for T and a recall of 0.54 for T. I would like to increase the precision of T and I am okay if the recall for T goes down as well, I am willing to accept this tradeoff. I do not care about the precision/recall of N predictions, I am just trying to improve the precision of T, that's all I care about. I guess increasing the precision means that we should abstain from making predictions in some situations that we are not certain about. How to do that?
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
from sklearn import tree
import collections
import pydotplus
# Function importing Dataset
column_count =0
def importdata():
balance_data = pd.read_csv( 'data1extended.txt', sep= ',')
row_count, column_count = balance_data.shape
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print("Number of columns ", column_count)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
return balance_data, column_count
def columns(balance_data):
row_count, column_count = balance_data.shape
return column_count
# Function to split the dataset
def splitdataset(balance_data, column_count):
# Separating the target variable
X = balance_data.values[:, 1:column_count]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
#Univariate selection
def selection(column_count, data):
# data = pd.read_csv("data1extended.txt")
X = data.iloc[:,1:column_count] #independent columns
y = data.iloc[:,0] #target column i.e price range
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=5)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
df=pd.DataFrame(data, columns=X)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(5,'Score')) #print 10 best features
return X,y,data,df
#Feature importance
def feature(X,y):
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()
#Correlation Matrix
def correlation(data, column_count):
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(column_count,column_count))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
def generate_decision_tree(X,y):
clf = DecisionTreeClassifier(random_state=0)
data_feature_names = ['callersAtLeast1T','CalleesAtLeast1T','callersAllT','calleesAllT','CallersAtLeast1N','CalleesAtLeast1N','CallersAllN','CalleesAllN','childrenAtLeast1T','parentsAtLeast1T','childrenAtLeast1N','parentsAtLeast1N','childrenAllT','parentsAllT','childrenAllN','ParentsAllN','ParametersatLeast1T','FieldMethodsAtLeast1T','ReturnTypeAtLeast1T','ParametersAtLeast1N','FieldMethodsAtLeast1N','ReturnTypeN','ParametersAllT','FieldMethodsAllT','ParametersAllN','FieldMethodsAllN']
#generate model
model = clf.fit(X, y)
# Create DOT data
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=data_feature_names,
class_names=y)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())
# Create PDF
graph.write_pdf("tree.pdf")
# Create PNG
graph.write_png("tree.png")
# Driver code
def main():
# Building Phase
data,column_count = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data, column_count)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
#COMMENTED OUT THE 4 FOLLOWING LINES DUE TO MEMORY ERROR
#X,y,dataheaders,df=selection(column_count,data)
#generate_decision_tree(X,y)
#feature(X,y)
#correlation(dataheaders,column_count)
# Calling main function
if __name__=="__main__":
main()
I would suggest using Pipelines, to build data pipelines and GridSearchCV to find the best possible hyper-parameters and classifiers for the pipe.
A basic example;
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest, chi2, f_class
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline[('kbest', SelectKBest(chi2, k=3000)),
('clf', DecisionTreeClassifier())
])
pipe_params = {'kbest__k': range(1, 10, 1),
'kbest__score_func': [f_classif, chi2],
'clf__max_depth': np.arange(1,30),
'clf__min_samples_leaf': [1,2,4,5,10,20,30,40,80,100]}
grid_search = GridSearchCV(pipe, pipe_params, n_jobs=-1
scoring=accuracy_score, cv=10)
grid_search.fit(X_train, Y_train)
This will iterate over every hyper-parameters in pipe_params and choose the best classifier based on accuracy_score.
write a program, use support vectore Regression-SVR to predict, firstly, split the dataset to train dataset and test dataset, the ratio of test dataset is 20%(case 1); secondly, use cross validate, split the dataset to 5 groups to predict(case 2),however, Using the same evaluation index(R2,MAE,MSE) to evaluate the two methods, the results are quite different
the program is as follows:
dataset = pd.read_csv('Dataset/allGlassStraightThroughTube.csv')
tube_par = dataset.iloc[:, 3:8].values
tube_eff = dataset.iloc[:, -1:].values
# # form train dataset , test dataset
tube_par_X_train, tube_par_X_test, tube_eff_Y_train, tube_eff_Y_test = train_test_split(tube_par, tube_eff, random_state=33, test_size=0.2)
# normalize the data
sc_X = StandardScaler()
sc_Y = StandardScaler()
sc_tube_par_X_train = sc_X.fit_transform(tube_par_X_train)
sc_tube_par_X_test = sc_X.transform(tube_par_X_test)
sc_tube_eff_Y_train = sc_Y.fit_transform(tube_eff_Y_train)
sc_tube_eff_Y_test = sc_Y.transform(tube_eff_Y_test)
# fit rbf SVR to the sc_tube_par_X dataset
support_vector_regressor = SVR(kernel='rbf')
support_vector_regressor.fit(sc_tube_par_X_train, sc_tube_eff_Y_train)
#
# # predict new result according to the sc_tube_par_X Dataset
pre_sc_tube_eff_Y_test = support_vector_regressor.predict(sc_tube_par_X_test)
pre_tube_eff_Y_test = sc_Y.inverse_transform(pre_sc_tube_eff_Y_test)
# calculate the predict quality
print('R2-score value rbf SVR')
print(r2_score(sc_Y.inverse_transform(sc_tube_eff_Y_test), sc_Y.inverse_transform(pre_sc_tube_eff_Y_test)))
print('The mean squared error of rbf SVR is')
print(mean_squared_error(sc_Y.inverse_transform(sc_tube_eff_Y_test), sc_Y.inverse_transform(pre_sc_tube_eff_Y_test)))
print('The mean absolute error of rbf SVR is')
print(mean_absolute_error(sc_Y.inverse_transform(sc_tube_eff_Y_test), sc_Y.inverse_transform(pre_sc_tube_eff_Y_test)))
# normalize
sc_tube_par_X = sc_X.fit_transform(tube_par)
sc_tube_eff_Y = sc_Y.fit_transform(tube_eff)
scoring = ['r2','neg_mean_squared_error', 'neg_mean_absolute_error']
rbf_svr_regressor = SVR(kernel='rbf')
scores = cross_validate(rbf_svr_regressor, sc_tube_par_X, sc_tube_eff_Y, cv=5, scoring=scoring, return_train_score=False)
in case 1, the evaluation index output is:
R2-score value rbf SVR
0.6486074476528559
The mean squared error of rbf SVR is
0.00013501023459497165
The mean absolute error of rbf SVR is
0.007196636233830076
in case 2, the evalution index output is:
R2-score
0.2621779727614816
test_neg_mean_squared_error
-0.6497292887710239
test_neg_mean_absolute_error
-0.5629408849740231
the difference between case 1 and case 2 is big, could you please me the reason and how to correct it
Bin.
I have prepared a little example to see how the results change using cross-validation. I recomend you to try to split the data without seed and see how the results change.
You will see that cross validation results are almost a constant independently of the data split.
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,cross_validate
#from sklearn.cross_validation import train_test_split
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
import matplotlib.pyplot as plt
def print_metrics(real_y,predicted_y):
# calculate the predict quality
print('R2-score value {:>8.4f}'.format(r2_score(real_y, predicted_y)))
print('Mean squared error is {:>8.4f}'.format(mean_squared_error(real_y, predicted_y)))
print('Mean absolute error is {:>8.4f}\n\n'.format(mean_absolute_error(real_y, predicted_y)))
def show_plot(real_y,predicted_y):
fig,ax = plt.subplots()
ax.scatter(real_y,predicted_y,edgecolors=(0,0,0))
ax.plot([real_y.min(),real_y.max()],[real_y.min(),real_y.max()],"k--",lw=4)
ax.set_xlabel("Measured")
ax.set_ylabel("Predicted")
plt.show()
# dataset load
boston = datasets.load_boston()
#dataset info
# print(boston.keys())
# print(boston.DESCR)
# print(boston.data.shape)
# print(boston.feature_names)
# numpy_arrays
X = boston.data
Y = boston.target
# # form train dataset , test dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
#X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=33, test_size=0.2)
#X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=5, test_size=0.2)
# fit scalers
sc_X = StandardScaler().fit(X_train)
# standarizes X (train and test)
X_train = sc_X.transform(X_train)
X_test = sc_X.transform(X_test)
######################################################################
############################### SVR ##################################
######################################################################
support_vector_regressor = SVR(kernel='rbf')
support_vector_regressor.fit(X_train, Y_train)
predicted_Y = support_vector_regressor.predict(X_test)
print_metrics(predicted_Y,Y_test)
show_plot(predicted_Y,Y_test)
######################################################################
########################### LINEAR REGRESSOR #########################
######################################################################
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
predicted_Y = lin_model.predict(X_test)
print_metrics(predicted_Y,Y_test)
show_plot(predicted_Y,Y_test)
######################################################################
######################### SVR + CROSS VALIDATION #####################
######################################################################
sc = StandardScaler().fit(X)
standarized_X = sc.transform(X)
scoring = ['r2','neg_mean_squared_error', 'neg_mean_absolute_error']
rbf_svr_regressor = SVR(kernel='rbf')
scores = cross_validate(rbf_svr_regressor, standarized_X, Y, cv=10, scoring=scoring, return_train_score=False)
print(scores["test_r2"].mean())
print(-1*(scores["test_neg_mean_squared_error"].mean()))
print(-1*(scores["test_neg_mean_absolute_error"].mean()))
Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here