Training a model when using Naive Bayes

Training a model when using Naive Bayes - python

I have a movie review dataset and I want to perform sentiment analysis on it.
I have implemented this using logistic regression. Following are the steps that I took in the process:
Removed stop words and punctuation from each row in the dataset.
Split the data into train, validation and test set.
Created a vocabulary of words from the training set.
Added every word in the vocabulary as a feature. If this word is in the current row, its TF-IDF value is set as the value of the feature, else 0 is set as the value.
Train the model. During training, sigmoid function is used for calculating the hypothesis and cross entropy loss is used for cost function. Then using gradient descent, the weights of the model were updated.
Tune hyperparameters using validation set
Evaluate model using test set
Now, I need to implement the same thing using Naive Bayes and I'm confused as to how to approach this problem. I assume the first 4 steps are going to be the same. But what is the training step when using Naive Bayes? What is the loss function and cost function in this case? And where do I use the Bayes' theorem to calculate the conditional probability? And how do I update the weights and biases?
I've searched a lot of resources on the web and I've mostly only found implementations using sklearn with model.fit and model.predict and I'm having a hard time figuring out the math behind this and how it could be implemented using vanilla python.

In the case of Logistic regression or SVM, the model is trying to predict the hyperplane which best fits the data. And so these models will determine the weights and biases.
Naive Bayes is moreover a probabilistic approach. It completely depends on Bayes' theorem.
There will be NO weights and biases in NB, there will only be CLASS WISE probability values for each of the features (i.e, words in case of text).
To avoid zero probabilities or to handle the case of unseen data (words in case of text), use Laplace Smoothing.
α is called the smoothening factor. And this will be hyperparameter in NB
Use log for numeric stability.
Test example: This movie is great
After removing the stopwords: movie great
From the training data, we already know prob value for the words movie and great both for +ve & -ve class. Refer STEP 2.
Prob of great for +ve class would be greater than the prob of great for -ve class. And for the word movie, prob values could be almost the same. (This highly depends on your training data. Here I am just making an assumption)
positive class prob = P(movie/+ve) * P(great/+ve)
negative class prob = P(movie/-ve) * P(great/-ve)
Compare the class prob values & return the one having high prob value.
P.S
If the number of words in the sentence is large in numbers, then the class value would become very very small. Using log would solve this problem.
If the word great wasn't there in the training set, the class prob value would be 0. So use smoothening factor-α (Laplace smoothing)
Refer sk-learn naive bayes for more detailed info

Related

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!

There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.

It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)

I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.

Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

Coefficients of Linear Model are way too large/low

During implementing a linear regression model on a bag of words, python returned very large/low values. train_data_features contains all words, which are in the training data. The training data contains about 400 comments of each less than 500 characters with a ranking between 0 and 5. Afterwards, I created a bag of words for each document. While trying to perform a linear regression on the matrix of all bag of words,
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(train_data_features, train['dim_hate'])
coef = clf.coef_
words = vectorizer.get_feature_names()
for i in range(len(words)):
print(str(words[i]) + " " + str(coef[i]))
the result seems to be very strange (just an example of 3 from 4000). It shows the factors of the created regression function for the words.
btw -0.297473967075
land 54662731702.0
landesrekord -483965045.253
I'm very confused because the target variable is between 0 and 5, but the factors are so different. Most of them have very high/low numbers and I was expecting only values like the one of btw.
Do you have an idea, why the results are like they are?

It might be that your model is overfitting to the data, since it's trying to exactly match the outputs. You're right to be worried and suspicious, because it means that your model is probably overfitting to your data and will not generalize well to new data. You can try one of two things:
Run LinearRegression(normalize=True) and see if it helps with the coefficients. But it will only be a temporary solution.
Use Ridge regression instead. It is basically doing Linear Regression, except adding a penalty for having coefficients that are too large.

Check for correlated features in your data-set.
You may run into the problem if your features are highly correlated. For example expenses per customer -
jan_expenses, feb_expenses, mar_expenses, Q1_expenses
the Q1 feature is the sum of the jan-mar and therefore your coefficients, when trying to fit, will go 'crazy' as it will struggle to find a line that best describes the monthly features and the Q feature. Try and remove the highly correlated features and re-run.
(btw Ridge regression also solved the problem for me but I was curious as to why this happens so i dug in a bit)

Scikit-learn categorisation: binomial log regression?

I have texts that are rated on a continous scale from -100 to +100. I am trying to classify them as positive or negative.
How can you perform binomial log regression to get the probability that test data is -100 or +100?
The closest I have got is the SGDClassifier( penalty='l2',alpha=1e-05, n_iter=10), but this doesn't provide the same results as SPSS when I use binomial log regression to predict the probability of -100 and +100. So I'm guessing this is not the right function?

SGDClassifier provides access to several linear classifiers, all trained with stochastic gradient decent. It will default to a linear support vector machine, unless you call it with a different loss function. loss='log' will provide a probabilistic logistic regression.
See the documentation at:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Alternatively, you could use sklearn.linear_model.LogisticRegression to classify your texts with a logistic regression.
It's not clear to me that you will get exactly the same results as you do with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.
Edited to add:
My suspicion is that the 99% accuracy you're getting with the SPSS logistic regression is training set accuracy, while the 87% that you're seeing with scikits-learn logistic regression is test set accuracy. I found this question on the datascience stack exchange where a different person is attempting and extremely similar problem, and getting ~99% accuracy on training sets and 90% test set accuracy.
https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features
My recommended path forwards is a follows: Try several different basic classifiers in scikits-learn, including the standard logistic regression and a linear SVM, and also rerun the SPSS logistic regression several times with different train/test subsets of your data and compare the results. If you continue to see a large divergence across classifiers that can't be accounted for by ensuring similar train/test data splits, then post the results that you're seeing into your question, and we can move forward from there.
Good luck!

If pos/neg, or the probability of pos, is really the only thing you need as output, then you can derive binary labels y as
y = score > 0
assuming you have the scores in a NumPy array score.
You can then feed this to a LogisticRegression instance, using the continuous score to derive relative weights for the samples:
clf = LogisticRegression()
sample_weight = np.abs(score)
sample_weight /= sample_weight.sum()
clf.fit(X, y, sample_weight)
This gives maximum weight to tweets with scores ±100, and a weight of zero to tweets that are labeled neutral, varying linearly between the two.
If the dataset is very large, then as #brentlance showed, you can use SGDClassifier, but you have to give it loss="log" if you want a logistic regression model; otherwise, you'll get a linear SVM.

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)

To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"

You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.

document classification using naive bayes in python

I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?

A few points that might help:
Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.
You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!

There could be many reasons for the classifier not working, and there are many ways to tweak it.
did you train it with enough positive and negative examples?
how did you train the classifier? did you give it every word as a feature, or did you also add more features for it to train on(like length of the text for example)?
what exactly are you trying to classify? does the specified classification have specific words that are related to it?
So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.

If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since
your using tf-idf) rather than the set of all boolean values.
If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that
tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)
rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.
This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.
You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.
For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.

Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.
If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.
There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.
Here is some literature on NB and the how both the Multinomial and Bernoulli models work:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.