How to improve classification of small texts

How to improve classification of small texts - python

The data that I've got are mostly tweets or small comments (300-400 chars). I used a Bag-Of-Word model and used NaiveBayes classification. Now I'm having a lot of misclassified cases which are of the type mentioned below :-
1.] He sucked on a lemon early morning to get rid of hangover.
2.] That movie sucked big time.
Now the problem is that during sentiment classification both are getting "Negative" just because of the word "sucked"
Sentiment Classification : 1.] Negative 2.] Negative
Similarly during document classification both are getting classified into "movies" due to the presence of word "sucked".
Document classification : 1.] Movie 2.] Movie
This is just one of such instances, I'm facing a huge number of wrong classifications and don't have any idea on how to improve the accuracy.

(1)
One straightforward possible change from Bag-of-Words with Naive Bayes is to generate polynomial combinations of Bag-of-Words features. It might solve the problems you have shown above.
"sucked" + "lemon" (positive)
"sucked" + "movie" (negative)
Of course, you can also generate polynomial combinations of n-grams but the number of features might be too large.
The scikit-learn library prepares a preprocessing class for the purpose.
sklearn.preprocessing.PolynomialFeatures (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Theoretically, SVM with the polynomial kernel does the same thing as PolynomialFeatures + linear SVM but slightly different regarding how you store the model information.
In my experience, PolynomialFeatures + linear SVM performs reasonably well for short text classification including sentiment analysis.
If the dataset size is not large enough, the training dataset might not contain "sucked" + "lemon". In the case, dimensionality reduction such as Singular Value Decomposition (SVD) and topic models such as Latent Dirichlet Allocation (LDA) are suitable tools to semantic clusters for words.
(2)
Another direction is to utilize more sophisticated natural language processing (NLP) techniques to extract additional information from short texts. For example, Part-of-Speech (POS) tagging, Named Entity Recognition (NER) will give more information than plain BoWs. A python library for NLP called Natural Language Toolkit (NLTK) implements those functions.
(3)
You can also take slow but steady way. Analyzing errors in prediction by the current model to design new hand-crafted features is a promising way to improve the accuracy of the model.
There is a library for short-text classification called LibShortText, which also contains an error analysis function and preprocessing functions such as TF-IDF weighting. It might help you to learn how to improve the model via error analysis.
LibShortText (https://www.csie.ntu.edu.tw/~cjlin/libshorttext/)
(4)
For further information, take a look at the literature on sentiment analysis of Tweets will give you more advanced information.

Maybe you could try to use a more powerful classifier like Support Vector Machines. Also depending on the amount of data you have, you could try deep learning with convolutional neural nets. For this you will need a huge number of training examples (100k-1M).

Related

it is normal that CNN give me better accuracy compared to LSTM in text classification?

For a text classification, I have data of 1000 reviews and I tried different neural networks. For the CNN I got an accuracy of 0.94 but with the LSTM I got a lower accuracy (0.88) is this normal because as far as I know the LSTM is specialized for text classification and it preserves the order of the word sequence?

Yes, this isn't abnormal and was shown in a lot of researches.
The performance of these models depends on many factors like the data you have and the task you are dealing with it.
For example, CNN can perform well if your task cares more about detecting some substantial features (like the sentiment).
However, RNN-based models can show their superiority when the sequential aspect of the data is matters, like in Machine Translation and Text Summarization tasks.
I don't believe that the "LSTM specialized for text classification" is true. It's better to say LSTM specialized to learn sequential data. LSTM can learn the texts and the relation between the tokens very well, but the task you defined maybe doesn't care about these linguistic features. For example, in sentiment classification, a model (like CNN) can care about just the presence of some words and achieves good results.

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.

In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks

Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

scikit-learn: classifying texts using custom labels

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!

There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors

document classification using naive bayes in python

I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?

A few points that might help:
Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.
You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!

There could be many reasons for the classifier not working, and there are many ways to tweak it.
did you train it with enough positive and negative examples?
how did you train the classifier? did you give it every word as a feature, or did you also add more features for it to train on(like length of the text for example)?
what exactly are you trying to classify? does the specified classification have specific words that are related to it?
So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.

If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since
your using tf-idf) rather than the set of all boolean values.
If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that
tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)
rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.
This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.
You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.
For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.

Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.
If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.
There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.
Here is some literature on NB and the how both the Multinomial and Bernoulli models work:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.