Naive Bayes probability always 1 - python

I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.
I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?!
The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...
A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?
Here's the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib
Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl')
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl')
...
#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)
Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).
I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).

"...the probability outputs from predict_proba are not to be taken too seriously"
I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.

Related

Building ML classifier with imbalanced data

I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.
When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.
There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.
Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.

sklearn MultinomialNB only predicts class priors

I am currently try to roll my own naive bayes classifier for categorical features to make sure I understood them. Now I wanted to compare them with sklearns MultinomialNB. But for some reason, I can't get the skearn version to run correctly.
Easiest thing to compare I thought was the kaggle Titanic dataset. So it did this (which is fairly simple, right?):
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
train = pd.read_csv('data/in/train.csv')
X = np.asarray(train[['Pclass']])
y = np.asarray(train['Survived'])
clf = MultinomialNB()
clf.fit(X, y)
clf.predict_proba(X)
But what it actually predicts (or not, in this case...) is that everyone on the Titanic dies. Or in other words when the class labels to predict are [0, 1], it predicts 0. The weirdest thing is, that it apparently just gives out the probabilities of the class prior P(y) (which I checked with my homebrewn algorithm ;)) for every prediction. So it apparently doesn't multiply it with the likelihood P(X|y).
Has anyone ever encountered this? Am I making some apparent mistake here?
Edit:
I think I got it now. If I transform the input dataset into a contingency table, and one-hot encode the input feature, it gives the same predicted probabilities. I used a smoothing of alpha=0 for comparison with my own algorithm:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
train = pd.read_csv('data/in/train.csv')
X_test = np.asarray(pd.get_dummies(train['Pclass']))
X = np.array(pd.crosstab(train[y_column], train['Pclass']))
y = np.array([0,1])
clf = MultinomialNB(alpha=0.0000000001, class_prior=np.array(class_prior))
clf.fit(X, y)
clf.predict_proba(X_test)
still, one thing I wonder about is, why I had to manually specify the class prior now. If I wouldn't have done that, sklearn now used an uninformed prior, [0.5, 0.5]...
The Multinomial Naive Bayes model is working exactly as it is supposed to work given a single feature. If you take a look at the formula for P(X|y), it is equal to 1 when the number of features n=1. Here is why.
The Naive Bayes models differ from each other by assumptions they make about the conditional distribution P(X | y). The Multinomial Naive Bayes assumes this is a multinomial distribution. The multinomial distribution models the probability of counts for rolling a (possibly biased) k-sided die n times.
Suppose, for example, that you are given a party of dice produced by two companies: FairDice and Crooks&Co. FairDice is known to produce to fair dice, and Crooks&Co produces loaded dice that overwhelmingly fall with 6 on top. You are asked to learn to predict a die's producer from throwing it several times and looking at the results. You throw each die several times and record the results in a dataset with 6 features. Each feature represents how many times the corresponding value occured when throwing a die.
count_1 count_2 count_3 count_4 count_5 count_6 fair_dice
5 6 4 7 6 5 1
3 2 1 2 1 13 0
Now this is an appropriate dataset for training a Multinomial Naive Bayes classifier.
Training a Multinomial Naive Bayes classifier on a single feature is equivalent to trying to classify one-sided dice with the same number on top.
A one-sided die. They exist!
E.g. if your feature has values [3,2,1], it means that you threw the first die three times and got 1 every time, threw the second die twice and got 1 both times, threw the third dice once and got 1. This gives you no information on the dice producer, so the best you can predict is the class prior, which is exactly what the algorithm does.

Multiclass classification using Gaussian Mixture Models with scikit learn

I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.
Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.
GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.

Scikitklearns TfidfTransformer makes my pipeline predict just one label

I have a pandas dataframe containing texts and labels, and I'm trying to predict the labels using scikit-learn's CountVectorizer, TfidfTransformer and MultinomialNB. Here's what the dataframe looks like:
text party
0 Herr ålderspresident! Att vara talman i Sverig... S
1 Herr ålderspresident! Ärade ledamöter av Sveri... M
2 Herr ålderspresident! Som företrädare för Alli... M
3 Val av andre vice talman Herr ålderspresident!... SD
4 Herr ålderspresident! Vänsterpartiet vill utny... V
When I construct a pipeline with the three estimators mentioned above, I only get a ~35% accuracy in my predictions, but when I remove the TfidfTransformer the accuracy is bumped up to a more reasonable ~75% accuracy.
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()), # problematic row
('clf', MultinomialNB()),
])
text_clf = text_clf.fit(df.text.values, df.party.values)
test = df.sample(500, random_state=42)
docs_test = test.text.values
predicted = text_clf.predict(docs_test)
np.mean(predicted == test.party.values)
# Out: either 0.35 or 0.75 depending on whether I comment out the problematic row above
When I get 0.35 and inspect predicted I see that it almost exclusively contains one label ('S'). This is the most common label in the original dataset, but that shouldn't impact the predictions, right? Any ideas on why I get these strange results?
EDIT: Link to data where anforandetext and parti are the relevant columns.
The reason that you are getting so much difference is because of smoothing. If you checkout the documentation of MultinomialNB class, checkout the alpha parameter. The default value for that is 1.0. This means that it implements Plus One smoothing by default. Plus one smoothing is a very common technique used with relative frequency estimates to account for unseen data. In Plus One smoothing, we add 1 to all raw counts to account for unseen terms and the sparsity of the document-term matrix.
However, when you end up using TF-IDF weights, the numbers that you get are very small and mostly between 0 - 1. To illustrate, if I use your data and only convert it into TF-IDF weights, this is the small snapshot of the TF-IDF weights that I obtain.
(0, 80914) 0.0698184481033
(0, 80552) 0.0304609466459
(0, 80288) 0.0301759343786
(0, 80224) 0.103630302925
(0, 80204) 0.0437500703747
(0, 80192) 0.0808649191625
You can see that these are really small numbers and adding 1 to them for smoothing will have a drastic effect on the calculations that Multinomial Naive Bayes makes. By adding 1 to these numbers, you completely change their scale for classification and hence your estimates mess up. I am assuming, you have a good idea about how Multinomial Naive Bayes works. If not, then definitely see this video. The video and my answer will be sufficient to understand what is going wrong over here.
You should either use a small value of alpha in TF-IDF case or you should build TF-IDF weights after doing smoothing on the raw counts. Also on a secondary note, please use cross-validation to get any accuracy estimates. By testing the model on a sample of the training data, your accuracy numbers will be extremely biased. I would recommend using cross-validation or a separate hold-out set to evaluate your model.
Hope that helps.

Classifying text documents with random forests

I've a set of 4k text documents.
They belong to 10 different classes.
I'm trying to see how random forest method performs classification.
The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.)
This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ?
Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)
Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).
Option 1:
"If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run."
from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
I'm not sure about the random forest in sklearn has a feature importance option. The random forest in R implements mean decrease in gini impurity as well as mean decrease in accuracy.
Option 2:
Do dimensionality reduction. Use PCA or another dimension reduction technique to change the dense matrix of N dimensions into a smaller matrix and then use this smaller less sparse matrix for the classification problem
Option 3:
Drop correlated features. I believe the random forest is supposed to be more robust to correlated features compared to multinomial logistic regression. That being said... it could be the case that you have a number of correlated features. If you have a lot of pairwise correlated variables, you can drop one of the two variables and you should in theory not lose "predictive power". In addition to pairwise correlation there is also multiple correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor

Categories