Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here
Related
I'm facing an issue where 3 different classifiers, all trained on the same dataset (the sklearn iris dataset), are outputting the exact same accuracy scores and confusion matrices. I've emailed my professor and asked if this was something normal and if she had any advice if it wasn't and all she gave me was basically "it's not normal, go back and look at your code".
I've done a fair bit of looking at my code since then, and I can't seem to see what's going on. I'm hoping someone on here will be able to shed some light on it for me and I'll be able to learn something from this experience.
Here is my code:
# Dataset
from sklearn import datasets
# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Classifiers
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# Performance Metrics
from sklearn.metrics import confusion_matrix, accuracy_score
if __name__ == '__main__':
# Read dataset into memory.
iris = datasets.load_iris()
# Extract independent and dependent variables into variables.
X = iris.data
y = iris.target
# Split training and test sets (70/30).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# Fit the scaler to the training set, and transform both the training and test sets dependent
# columns, which are all of them since none of the dependent variables contain categorical data.
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Create the classifiers.
dt_classifier = DecisionTreeClassifier(random_state=0)
svm_classifier = SVC(kernel='rbf', random_state=0)
lr_classifier = LogisticRegression(random_state=0)
# Fit the classifiers to the training data.
dt_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
lr_classifier.fit(X_train, y_train)
# Predict using the now trained classifiers.
dt_y_pred = dt_classifier.predict(X_test)
svm_y_pred = svm_classifier.predict(X_test)
lr_y_pred = lr_classifier.predict(X_test)
# Create confusion matrices using the predicted results and the actual results from the test set.
dt_cm = confusion_matrix(y_test, dt_y_pred)
svm_cm = confusion_matrix(y_test, svm_y_pred)
lr_cm = confusion_matrix(y_test, lr_y_pred)
# Calculate accuracy scores using the predicted results and the actual results from the test set.
dt_score = accuracy_score(y_test, dt_y_pred)
svm_score = accuracy_score(y_test, svm_y_pred)
lr_score = accuracy_score(y_test, lr_y_pred)
# Print confusion matrices and accuracy scores for each classifier.
print('--- Decision Tree Classifier ---')
print(f'Confusion Matrix:\n{dt_cm}')
print(f'Accuracy Score:{dt_score}\n')
print('--- Support Vector Machine Classifier ---')
print(f'Confusion Matrix:\n{svm_cm}')
print(f'Accuracy Score:{svm_score}\n')
print('--- Logistic Regression Classifier ---')
print(f'Confusion Matrix:\n{lr_cm}')
print(f'Accuracy Score:{lr_score}')
Output:
--- Decision Tree Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Support Vector Machine Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Logistic Regression Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
As you can see, the outputs for each different classifier are exactly the same. Any sort of help that anyone could give me would be greatly appreciated.
There is nothing wrong with your code.
Such similarities in results are not to be unexpected when:
The data are rather "easy"
The sample is too small
Both these premises hold here. The iris data are notoriously easy to classify with modern ML algorithms (including the ones you use here); this, combined with the ridiculously small size of your test set (just 45 samples), make such results unsurprising.
In fact, simply by changing your data split to use a test_size=0.20, you will get a perfect accuracy of 1.0 from all 3 models.
Nothing to worry about.
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
I've trained and tested my logistic regression using available data but now need to output a future prediction. I want to include the 2017 values that I used in my training and test set to predict the 2018 probability.
This is the code I used to train and test my model:
Xadj = train.ix[:,('2016 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016')]
#Coded is the transformation of 2017 transaction count to a binary variable
y = y=train.ix[:,('2017 transaction count coded')]
logit_model=sm.Logit(y,Xadj)
result=logit_model.fit()
print(result.summary())
X_train, X_test, y_train, y_test = train_test_split(Xadj, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
#Cross Validation
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
In an attempt to export predictions for 2018, I have done the following:
#Create 2018 Purchase Probability
train['2018 Purchase Probability']=pd.DataFrame({'2018 Purchase Probability' : []})
yact=train.ix[:,('2018 Purchase Probability')]
#Adding in 2017 values
X = train.ix[:, ('2017 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016','create_year_2017')]
from sklearn.preprocessing import scale, StandardScaler
scaler = StandardScaler()
scaler.fit(Xadj)
X = scaler.transform(Xadj)
X_pred = scaler.transform(X)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(Xadj, y)
#Generate 0/1 prediction
prediction = logreg.predict(X= X)
#Generate odds ratio
precent_prediction = logreg.predict_proba(X= X)
prediction = pd.DataFrame(prediction)
I'm not sure if I've done this correctly and judging from my output (which is mostly 1's) I don't think I have. I am new to coding in Python and am struggling to turn my tested model into a future prediction that can be used to make decisions.
Thanks in advance for any help!
I am newbie in Machine learning. Recently, I have learnt how to calculate confusion_matrix for Test set of KNN Classification. But I do not know, how to calculate confusion_matrix for Training set of KNN Classification?
How can I compute confusion_matrix for Training set of KNN Classification from the following code ?
Following code is for computing confusion_matrix for Test set :
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.
For k-fold cross-validation:
I am also trying to find confusion_matrix for Training set using k-fold cross-validation.
I am confused to this line knn.fit(X_train, y_train).
Whether I will change this line knn.fit(X_train, y_train) ?
Where should I change following code for computing confusion_matrix for training set ?
# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)
skf = StratifiedKFold(y, kfold, random_state = 0)
# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution
# Note: in future versions of scikit.learn, this module will be fused with kfold
skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
skfind[cnt] = train_index
cnt = cnt + 1
# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix
n_neighbors = 1; # better to have this at the start of the code
# 10-fold Cross Validation
for i in range(kfold):
train_indices = skfind[i][0]
test_indices = skfind[i][1]
clf = []
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# fit Training set
clf.fit(X_train,y_train)
# predict Test data
y_predcit_test = []
y_predict_test = clf.predict(X_test) # output is labels and not indices
# Compute confusion matrix
cm = []
cm = confusion_matrix(y_test,y_predict_test)
print(cm)
# conf_mat = conf_mat + cm
You dont have to make much changes
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
Here instead of using X_test we use X_train for classification and then we produce a classification matrix using the predicted classes for the training dataset and the actual classes.
The idea behind a classification matrix is essentially to find out the number of classifications falling into four categories(if y is binary) -
predicted True but actually false
predicted True and actually True
predicted False but actually True
predicted False and actually False
So as long as you have two sets - predicted and actual, you can create the confusion matrix. All you got to do is predict the classes, and use the actual classes to get the confusion matrix.
EDIT
In the cross validation part, you can add a line y_predict_train = clf.predict(X_train) to calculate the confusion matrix for each iteration. You can do this because in the loop, you initialize the clf everytime which basically means reseting your model.
Also, in your code you are finding the confusion matrix each time but you are not storing it anywhere. At the end you'll be left with a cm of just the last test set.
I'm working on an example of applying Restricted Boltzmann Machine on Iris dataset. Essentially, I'm trying to make a comparison between RMB and LDA. LDA seems to produce a reasonable correct output result, but the RBM isn't. Following a suggestion, I binarized the feature inputs using skearn.preprocessing.Binarizer, and also tried different threshold parameter values. I tried several different ways to apply binarization, but none seemed to work for me.
Below is my modified version of the code based on this user's version User: covariance.
Any helpful comments are greatly appreciated.
from sklearn import linear_model, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.lda import LDA
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:,:2] # we only take the first two features.
Y = iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
# Models we will use
rbm = BernoulliRBM(random_state=0, verbose=True)
binarizer = preprocessing.Binarizer(threshold=0.01,copy=True)
X_binarized = binarizer.fit_transform(X_train)
hidden_layer = rbm.fit_transform(X_binarized, Y_train)
logistic = linear_model.LogisticRegression()
logistic.coef_ = hidden_layer
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
lda = LDA(n_components=3)
#########################################################################
# Training RBM-Logistic Pipeline
logistic.fit(X_train, Y_train)
classifier.fit(X_binarized, Y_train)
#########################################################################
# Get predictions
print "The RBM model:"
print "Predict: ", classifier.predict(X_test)
print "Real: ", Y_test
print
print "Linear Discriminant Analysis: "
lda.fit(X_train, Y_train)
print "Predict: ", lda.predict(X_test)
print "Real: ", Y_test
RBM and LDA are not directly comparable, as RBM doesn't perform classification on its own. Though you are using it as a feature engineering step with logistic regression at the end, LDA is itself a classifier - so the comparison isn't very meaningful.
The BernoulliRBM in scikit learn only handles binary inputs. The iris dataset has no sensible binarization, so you aren't going to get any meaningful outputs.