Probabilities for multiclass problem using OneVsRestClassifier do not sum to 1 - python

I have a multiclass classification problem for various classifiers (random forest, SVM, NN) and I use OneVsRestClassifier to wrap my models. I want to use an interpretability method (LIME) which makes use of probabilities that sum to 1, but when I use the function predict_proba, the sum of the matrix does not always sum to 1.
It's a multiclass classification problem. I have checked my raw data, my binarized values, and my train/test data to check that there is no overlap of classes. Each instance has a distinct label (100, 010, or 001).
x = pd.read_pickle(r"x.pkl").values
y = pd.read_pickle(r"y.pkl").values
# binarize labels for multilabel auc calculations
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# create train and test sets, stratified
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size = 0.20, random_state=5)
rfclassifier = RandomForestClassifier(n_estimators=100, random_state=5, criterion = 'gini', bootstrap = True)
classifier = OneVsRestClassifier(rfclassifier)
classifier.fit(x_train, y_train)
prediction = classifier.predict(x_test)
probability = classifier.predict_proba(x_test)
#check probabilities
print(classifier.predict_proba([x_test[0]]).round(3))
print(classifier.predict_proba([x_test[1]]).round(3))
print(classifier.predict_proba([x_test[20]]).round(3))
The print statements show examples for label 1, 0, and 2 respectively.
The outputs are [[0.164 0.836 0. ]], [[0.953 0.015 0. ]], and [[0.01 0.12 0.96]]. The last two (as well as many other instances) do not sum to 0 and prevent me from implementing the interpretability method.

Related

How to implement Gaussian Naive Bayes in two training sets

how am I supposed to implement Gaussian Naive Bayes, in two training sets.
I need:
Create a training set by selecting the rows with id <= 160
Train a Gaussian Naive-Bayes classifier as we saw in class to determine if a campaign will be successful, given the amounts used in each marketing channel
Calculate the fraction of the training set that is correctly classified.
and:
Create a test set by selecting the rows with id> 160
Evaluate the performance of the classifier as follows:
What percentage of the test set was classified
correctly (correct answers on the total)? It is desirable that this number reaches at least 80%
What is the ratio of false positives to false negatives?
Successful marketing campaign:
successful_marketing_campaign = (dataset['sales'] > 15) | (dataset['total_invested'] < 20)
And my code:
X = dataset.iloc[:, [0, 3]].values.astype('int')
y = dataset.iloc[:, [4]].values.astype('int')
X_train = dataset.iloc[0:160, [0, 3]].values.astype('int')
y_train = dataset.iloc[0:160, 4].values.astype('int')
X_test = dataset.iloc[160:, [0, 3]].values.astype('int')
y_test = dataset.iloc[160:, 4].values.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

How to tell a SciKit LinearRegression model that a predicted value cannot be less than Zero?

I have the following code that attempts to valuate stock on non-price based features.
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = LinearRegression()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
In my df (which is very large), there is never an instance where 'regularMarketPrice' is less than 0. However, I occasionally receive a value less than 0 for some points in y_predicted.
Is there a way in Scikit to say anything less than 0 is an invalid prediction? I am hoping this makes my model more accurate.
Please comment if there is a need for further explanation.
To make more prediction larger than 0, you should not use linear regression. You should consider generalized linear regression (glm), such as poisson regression.
from sklearn.linear_model import PoissonRegressor
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = PoissonRegressor()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
All prediction is greater than or equal to 0
Consider using something other than a Gaussian response variable. Plot your y-values with a histogram. If the data are right skewed, considering modeling with a glm, gamma distribution, and log link.
Alternatively, you could set the y_predicted to the max of the model.score value and 0.

How to do an evaluation of Logistic Regression with imbalanced dataset using sklearn?

I make Logistic Regression using python scikit-learn. I have an imbalanced dataset with 2/3 of datapoints having label y=0 and 1/3 having label y=1.
I do a stratified splitting:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, stratify=y)
My grid for the hyperprameter-search is:
grid = {
'penalty': ['l1', 'l2', 'elasticnet'],
'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
Then I do a grid search including class_weight='balanced':
grid_search = GridSearchCV(
estimator=LogisticRegression(
max_iter=200,
random_state=1111111111,
class_weight='balanced',
multi_class='auto',
fit_intercept=True
),
param_grid=grid,
scoring=score,
cv=5,
refit=True
)
My first question is regarding the score. This is the method for choosing what is the "best" classifier in the GridSearchCV, to find the best hyper-parameters. Since I performed the LogisticRegression with class_weight='balanced', should I use the classic score='accuracy', or do I still need to use score='balanced_accuracy'? And why?
So I go on and find the best classifier:
best_clf = grid_search.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)
And now I want to calculate evaluation metrics, for example also the accuracy (again) and the f1-score.
Second question: Do I here need to use the "normal" accuracy/f1 or the balanced/weighted accuracy/f1?
"Normal":
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label=1, average='binary')
Or balanced/weighted:
acc_weighted = balanced_accuracy_score(y_test, y_pred, sample_weight=y_weights)
f1_weighted = f1_score(y_test, y_pred, sample_weight=y_weights, average='weighted')
IF I should be using the balanced/weighted version, my third question regards the parameter sample_weight=y_weights. How should I set the weights? To receive a balance (although as I said I am not sure if I already have a balance achieved or not setting class_weight='balanced'), I should scale the label y=0 with 1/3 and y=1 with 2/3, right? Like this:
y_weights = [x*(1/3)+(1/3) for x in y_test]
Or should I enter here the real distribution and scale label y=0 with 2/3 and label y=1 with 1/3? Like this:
y_weights = [x*(-1/3)+(2/3) for x in y_test]
My final question is: For evaluation, what would be the baseline accuracy that I compare my accuracy to?
0.33 (class 1), 0.5 (after balancing), or 0.66 (class 0)?
Edit: With baseline I mean a model that naively classifies all data as "1" or a model that classifies all data as "0". A problem is that I don't know if I can choose freely. For example, I get an accuracy or a balanced_accuracy of 0.66. If I compare with Baseline "always 1" (acc 0.33 (?)), my model is better. If I compare with baseline "always 0" (acc 0.66 (?)), my model is worse.
Thank you all very much for helping me.

Calculate confusion_matrix for Training set

I am newbie in Machine learning. Recently, I have learnt how to calculate confusion_matrix for Test set of KNN Classification. But I do not know, how to calculate confusion_matrix for Training set of KNN Classification?
How can I compute confusion_matrix for Training set of KNN Classification from the following code ?
Following code is for computing confusion_matrix for Test set :
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.
For k-fold cross-validation:
I am also trying to find confusion_matrix for Training set using k-fold cross-validation.
I am confused to this line knn.fit(X_train, y_train).
Whether I will change this line knn.fit(X_train, y_train) ?
Where should I change following code for computing confusion_matrix for training set ?
# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)
skf = StratifiedKFold(y, kfold, random_state = 0)
# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution
# Note: in future versions of scikit.learn, this module will be fused with kfold
skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
skfind[cnt] = train_index
cnt = cnt + 1
# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix
n_neighbors = 1; # better to have this at the start of the code
# 10-fold Cross Validation
for i in range(kfold):
train_indices = skfind[i][0]
test_indices = skfind[i][1]
clf = []
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# fit Training set
clf.fit(X_train,y_train)
# predict Test data
y_predcit_test = []
y_predict_test = clf.predict(X_test) # output is labels and not indices
# Compute confusion matrix
cm = []
cm = confusion_matrix(y_test,y_predict_test)
print(cm)
# conf_mat = conf_mat + cm
You dont have to make much changes
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
Here instead of using X_test we use X_train for classification and then we produce a classification matrix using the predicted classes for the training dataset and the actual classes.
The idea behind a classification matrix is essentially to find out the number of classifications falling into four categories(if y is binary) -
predicted True but actually false
predicted True and actually True
predicted False but actually True
predicted False and actually False
So as long as you have two sets - predicted and actual, you can create the confusion matrix. All you got to do is predict the classes, and use the actual classes to get the confusion matrix.
EDIT
In the cross validation part, you can add a line y_predict_train = clf.predict(X_train) to calculate the confusion matrix for each iteration. You can do this because in the loop, you initialize the clf everytime which basically means reseting your model.
Also, in your code you are finding the confusion matrix each time but you are not storing it anywhere. At the end you'll be left with a cm of just the last test set.

How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK

I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.
Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.
I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.
Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).
Assuming your training set is in a list named training, a simple way to accomplish this would be,
num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
testing_this_round = training[i*subset_size:][:subset_size]
training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
# train using training_this_round
# evaluate against testing_this_round
# save accuracy
# find mean accuracy over all rounds
Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).
Scikit provides cross_val_score, which does all the looping under the hood.
from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:
import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)
for traincv, testcv in cv:
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])
and at the end I calculated the average accuracy
Modified the second answer:
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)
Inspired from Jared's answer, here is a version using a generator:
def k_fold_generator(X, y, k_fold):
subset_size = len(X) / k_fold # Cast to int if using Python 3
for k in range(k_fold):
X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
X_valid = X[k * subset_size:][:subset_size]
y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
y_valid = y[k * subset_size:][:subset_size]
yield X_train, y_train, X_valid, y_valid
I am assuming that your data set X has N data points (= 4 in the example) and D features (= 2 in the example). The associated N labels are stored in y.
X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2
for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
# Train using X_train and y_train
# Evaluate using X_valid and y_valid

Categories