How to back check classification using sklearn

How to back check classification using sklearn - python

I am running two different classification algorithms on my data logistic regression and naive bayes but it is giving me same accuracy even if I change the training and testing data ratio. Following is the code I am using
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv('Speed Dating.csv', encoding = 'latin-1')
X = pd.DataFrame()
X['d_age'] = df ['d_age']
X['match'] = df ['match']
X['importance_same_religion'] = df ['importance_same_religion']
X['importance_same_race'] = df ['importance_same_race']
X['diff_partner_rating'] = df ['diff_partner_rating']
# Drop NAs
X = X.dropna(axis=0)
# Categorical variable Match [Yes, No]
y = X['match']
# Drop y from X
X = X.drop(['match'], axis=1)
# Transformation
scalar = StandardScaler()
X = scalar.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print('Accuracy Score with Logistic Regression: ', accuracy_score(y_test, model.predict(X_test)))
#Naive Bayes
model_2 = GaussianNB()
model_2.fit(X_train, y_train)
print('Accuracy Score with Naive Bayes: ', accuracy_score(y_test, model_2.predict(X_test)))
print(model_2.predict(X_test))
Is it possible that every time the accuracy is same ?

This is common phenomena occurring if the class frequencies are unbalanced, e.g. nearly all samples belong to one class. For examples if 80% of your samples belong to class "No", then classifier will often tend to predict "No" because such a trivial prediction reaches the highest overall accuracy on your train set.
In general, when evaluating the performance of a binary classifier, you should not only look at the overall accuracy. You have to consider other metrics such as the ROC Curve, class accuracies, f1 scores and so on.
In your case you can use sklearns classification report to get a better feeling what your classifier is actually learning:
from sklearn.metrics import classification_report
print(classification_report(y_test, model_1.predict(X_test)))
print(classification_report(y_test, model_2.predict(X_test)))
It will print the precision, recall and accuracy for every class.
There are three options on how to reach a better classification accuracy on your class "Yes"
use sample weights, you can increase the importance of the samples of the "Yes" class thus forcing the classifier to predict "Yes" more often
downsample the "No" class in the original X to reach more balanced class frequencies
upsample the "Yes" class in the original X to reach more balanced class frequencies

Related

Compute cross validation for multi-variate linear regression

I am training different models for a regression problem. Since i want to find the best model between the choices, i wanted to perform a cross validation with k = 20, to characterize the MSE of the models, and statistically determine what model is the better between them.
The problem has got multiple dependant variables, and i would like to determinate the MSE separately for both dependant variables, but cross_val_score doesnt let me do that explicitely.
Here is some example code of one of my models:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x_test)
mse = mean_squared_error(scaler2.inverse_transform(y_test), scaler2.inverse_transform(y_pred), multioutput="raw_values")
How can i iterate training on the k times corresponding to the k models trained and tested in a k fold cross validation?
Scikit provides a Kfold but it is just a way to specify the number of folds, and it doesnt actually returns the training and test folds, so i can't think a way to actually train different models using kfold cross validation theory. Plus, i would need to evaluate MSE seprately on each dependant variable since it's a multiple regression problem

You can use Scikit Learn KFold Cross Validation with just a simple for loop.
And here is a example testing 5-fold cross validation on bayes classifer:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k)
res = []
for train_index , test_index in kf.split(X_train_concat):
X_train_kf , X_test_kf = X_train_concat[train_index,:],X_train_concat[test_index,:]
y_train_kf , y_test_kf = y_train_concat[train_index] , y_train_concat[test_index]
X_train = np.append(X_train_concat, np.reshape(y_train_concat, (len(y_train_concat),1)), axis=1)
W_bayes = trainBayes(X_train)
y_pred = predict(X_test_kf, W_bayes)
mis_classification = len(y_pred)-np.count_nonzero(y_pred == y_test_kf)
e = (mis_classification / y_test_kf.shape[0]) * 100
res.append(e)
avg_res = sum(res)/k
print('Result of each fold - {}'.format(res))
print('Avg result : {}'.format(avg_res))
For more check this

How do I apply ML model after it has been trained?

I apologies for the naive question, I've trained a model (Naive Bayes) in python , it does well (95% accuracy). It takes an input string (i.e. 'Apple Inc.' or 'John Doe') and discerns whether it's a business name or customer name.
How do I actually implement this on another data set? If I bring in another pandas dataframe, how do I apply what the model has learned from the training data to the new dataframe?
The new dataframe has a completely new population and set of strings that it needs to predict whether its a business or customer name.
Ideally I would like to insert into the new dataframe a column that has the model's prediction.
Any code snippets are appreciated.
Sample code of current model:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df["CUST_NM_CLEAN"],
df["LABEL"],test_size=0.20,
random_state=1)
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
# Transform testing data and return the matrix.
testing_data = count_vector.transform(X_test)
#in this case we try multinomial, there are two other methods
from sklearn.naive_bayes import cNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
#MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictions = naive_bayes.predict(testing_data)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))
print('Precision score: {}'.format(precision_score(y_test, predictions, pos_label='Org')))
print('Recall score: {}'.format(recall_score(y_test, predictions, pos_label='Org')))
print('F1 score: {}'.format(f1_score(y_test, predictions, pos_label='Org')))

Figured it out.
# Convert a collection of text documents to a vector of term/token counts.
cnt_vect_for_new_data = count_vector.transform(df['new_data'])
#RUN Prediction
df['NEW_DATA_PREDICTION'] = naive_bayes.predict(cnt_vect_for_new_data)

How to get the predicted probabilities of a classification model?

I'm trying out different classification models using a binary dependent variable (occupied/unoccupied). The models I am interested in are Logistic regression, Decision tree and Gaussian Naïve Bayes.
My input data is a csv-file with a datetime index (e.g. 2019-01-07 14:00), three variable columns ("R", "P", "C", containing numerical values), and the dependent variable column ("value", containing the binary values).
Training the model is not the problem, that all works fine. All the models give me their prediction in binary values (this of course should be the ultimate outcome), but I would also like to see the predicted probabilities which made them decide on either of the binary values. Is there any way to get also these values?
I have tried all of the classification visualizers that function with the yellowbrick package (ClassBalance, ROCAUC, ClassificationReport, ClassPredictionError). But all of these don't give me a graph that shows the calculated probabilities by the model for the data set.
import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1)
###model training
###Logistic Regression###
clf_lr = LogisticRegression()
# fit the dataset into LogisticRegression Classifier
clf_lr.fit(X_train, y_train)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)
###Decision Tree###
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
pred_dt = clf_dt.fit(X_train, y_train).predict(X_test)
###Bayes###
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
###visualization for e.g. LogReg
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
#classificationreport
visualizer = ClassificationReport(clf_lr, support=True)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data
#classprediction report
visualizer2 = ClassPredictionError(LogisticRegression())
visualizer2.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer2.score(X_test, y_test) # Evaluate the model on the test data
g2 = visualizer2.poof() # Draw visualization
#(ROC)
visualizer3 = ROCAUC(LogisticRegression())
visualizer3.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer3.score(X_test, y_test) # Evaluate the model on the test data
g3 = visualizer3.poof() # Draw/show/poof the data
it would be great to have e.g. an array similar to pred_lr that contains the probabilities calculated for each row of the csv file. Is that possible? If yes, how can I get it?

In most sklearn estimators (if not all) you have a method for obtaining the probability that precluded the classification, either in log probability or probability.
For example, if you have your Naive Bayes classifier and you want to obtain probabilities but not classification itself, you could do (I used same nomenclatures as in your code):
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
#for probabilities
bayes.predict_proba(X_test)
bayes.predict_log_proba(X_test)
Hope this helps.

Which accuracy score to use for the Mean Decrease Accuracy with the scikit RandomForestClassifier

I've been running the implementation the 'Mean Decrease Accuracy' measure that is shown on this website:
In the example the author is using the random forest regressor RandomForestRegressor, but I am using the random forest classifier RandomForestClassifier. Thus, my question is, if I should also use the r2_score for measuring accuracy or if I should switch to classic accuracy accuracy_score or matthews correlation coefficient matthews_corrcoef?.
Does anybody here if I should switch or not. And why?
Thanks for any help!
Here is the code from the website in case you are too lazy to click :)
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
X = boston["data"]
Y = boston["target"]
rf = RandomForestRegressor()
scores = defaultdict(list)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = rf.fit(X_train, Y_train)
acc = r2_score(Y_test, rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True)

r2_score is for regression (continuous response variable), whereas classic classification (discrete categorical variable) metrics such like accuracy_score and f1_score roc_auc (the last two are most appropriate if you have unbalanced y-labels) are right choices for your task.
Random shuffling each features in the input data matrix and measuring the decline in these classification metrics sounds like a valid approach to rank feature importances.

Biasing Sklearn toward positives For MultinomialNB

I am trying to run multinomial naive bayes on a series of examples in python using sci kit learn. I am consitently getting all examples classified as negative. The training set is somewhat biased towards negatives P(negative) ~.75. I looked through the documentation and I couldn't find a way to bias toward positives.
from sklearn.datasets import load_svmlight_file
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
X_train, y_train= load_svmlight_file("POS.train")
x_test, y_test = load_svmlight_file("POS.val")
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('accuracy: ' + str(accuracy_score(y_test, preds)))
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))

Setting a prior is a poor way to handle this and will result in negative cases being classified as positive that really shouldn't be. Your data has a .25/.75 split, so a .5/.5 prior is a pretty bad option.
Instead, one can average the precision and recall with a harmonic mean to produce an F score which attempts to properly handle biased data like this:
from sklearn.metrics import f1_score
The F1 score can then be used to assess the quality of the model. You can then do some model tuning and cross validation to find a model that better classifies your data i.e. the model that maximizes the F1 score.
Another option is to randomly prune out the negative cases in your data so that the classifier is trained with .5/.5 data. The predict step should then give more appropriate classifications.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to back check classification using sklearn - python

Related

Compute cross validation for multi-variate linear regression

How do I apply ML model after it has been trained?

How to get the predicted probabilities of a classification model?

Which accuracy score to use for the Mean Decrease Accuracy with the scikit RandomForestClassifier

Biasing Sklearn toward positives For MultinomialNB

Categories

Resources