Cleaned text has significantly worse classification accuracy?

Cleaned text has significantly worse classification accuracy? - python

I'm trying to classify whether or not I liked books that I've read this year based on the text in the books. I'm using the preprocessing described here, and a variety of sklearn classification models.
At first I was just feeding the models the raw text, but I cleaned it based on GloVe embeddings (a process described here). The text was improved from 40% vocab, 80% coverage to 80% vocab, 98% coverage based on GloVe embeddings. However, for some reason, after cleaning the text, the accuracy of the classification models seemed to be the same or lower.
Uncleaned text model results:
Cleaned text model results:
One thing to note is that the classes are quite imbalanced (75% of books were good as compared to 25% bad), so accuracy above 75% should be expected, since 75% is what the model would get if it guessed good every single time.
I've linked my full notebook here so you can check out the specific code if that will be helpful for solving this issue. I'm incredibly confused; I can't see where I'm going wrong, but it can't be right that cleaning the text data has zero or negative impact on model accuracy.

I think the main point you are missing is that data cleaning is an empirical process. Text preprocessing may consist of removing stop words, punctuations, numericals, lowercasing, but if this adds to model's ability to learn and generalize remains to be seen through Cross Validation, i.e. feeding results of your peprocessing to model train and seeing if this generalizes to test well.
In general preproceeeing (stop words removal, etc) works well for Bag of Words models because it reduces data dimensionality because data in BOW is long and sparse (check out Curse of dimensionality e.g. for possible theoretical foundations). The need for data preprocessing is diminished with word embeddings like word2vec or BERT.
In short, if you have any data preprocessing in mind, check if it helps your model to learn and generalize through properly constructed CV.

Related

Does Fine-tunning Bert Model in multiple times with different dataset make it more accuracy?

i'm totally new in NLP and Bert Model.
What im trying to do right now is Sentiment Analysis on Twitter Trending Hashtag ("neg", "neu", "pos") by using DistilBert Model, but the accurazcy was about 50% ( I tried w Label data taken from Kaggle).
So here is my idea:
(1) First, I will Fine-tunning Distilbertmodel (Model 1) with IMDB dataset,
(2) After that since i've got some data took from Twitter post, i will sentiment analysis them my Model 1 and get Result 2.
(3) Then I will refine-tunning Model 1 with the Result 2 and expecting to have Model (3).
Im not really sure this process has any meaning to make the model more accuracy or not.
Thanks for reading my post.

I'm a little skeptical about your first step. Since the IMDB database is different from your target database, I do not think it will positively affect the outcome of your work. Thus, I would suggest fine-tuning it on a dataset like a tweeter or other social media hashtags; however, if you are only focusing on hashtags and do not care about the text, that might work! My little experience with fine-tuning transformers like BART and BERT shows that the dataset that you are working on should be very similar to your actual data. But in general, you can fine-tune a model with different datasets, and if the datasets are structured for one goal, it can improve the model's accuracy.

If you want to fine-tune a sentiment classification head of BERT for classifying tweets, then I'd recommend a different strategy:
IMDB dataset is a different kind of sentiment - the ratings do not really correspond with short post sentiment, unless you want to focus on tweets regarding movies.
using classifier output as input for further training of that classifier is not really a good approach, because, if the classifier made many mistakes while classifying, these will be reflected in the training, and so the errors will deapen. This is basically creating endogenous labels, which will not really improve your real-world classification.
You should consider other ways of obtaining labelled training data. There are a few good examples for twitter:
Twitter datasets on Kaggle - there are plenty of datasets available containing millions of various tweets. Some of those even contain sentiment labels (usually inferred from emoticons, as these were proven to be more accurate than words in predicting sentiment - for explanation see e.g. Frasincar 2013). So that's probably where you should look.
Stocktwits (if youre interested in financial sentiments)- contain posts that authors can label for sentiments, thus are a perfect way of mining labelled data, if stocks/cryptos is what you're looking for.
Another thing is picking a model that's better for your language, I'd recommend this one. It has been pretrained on 80M tweets, so should provide strong improvements. I believe it even contains a sentiment classification head that you can use.
Roberta Twitter Base
Check out the website for that and guidance for loading the model in your code - it's very easy, just use the following code (this is for sentiment classification):
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
Another benefit of this model is that it has been pretrained from scratch with a vocabulary that contains emojis, meaning it has a deep understanding of them, their typical contexts and co-occurences. This can greatly benefit a social media classification, as many researchers would agree that emojis/emoticons are better predictors of sentiment than normal words.

What should be the ideal validation accuracy of a LSTM based text generator?

I modelled a LSTM based text generator using a data set I have. The purpose of the model is to predict the end of sentences. My training is showing a validation accuracy of around 81%. When reading through a couple of articles, I found that unlike a classification problem I should be worried more about loss rather than accuracy. Is this the case, and if so what would be an ideal loss value? Right now my loss is around 1.5+.

There is no minimum limit for accuracy in any of the machine learning or Deep Learning problem.It's as many say garbage IN, garbage OUT
Quality of data and with a decent model will give you good accuracy.
Generally, these accuracy benchmark is set for the standard dataset available on an open internet like SQUAD, RACE, SWAG, GLUE and many more.
Usually, the state of the art models will check their performance on these datasets and set a accuarcy benchmark specific to these dataset.
Coming to your problem, you can tell the model is performing goog based on accuracy, and the evaluation metric you are using, generally in NLP to calculate loss is bit tricky. Considering your case where you are trying to predict the end of a sentence where there is no fixed dimension the reason being that the same information can be expressed in multiple ways with varying number of words.
By looking at the validation and test accuracy of your model it looks decent, but before pushing the accuracy you should worry about the overfitting problem also, the model should not be biased on your data.
You can try with different metrics to evaluate the model and you can compare the results on your own.
I hope this answers your question, Happy Learning!

How to use doc2vec model in production?

I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows.
dataset['tagged_descriptions'] = datasetf.apply(lambda x: doc2vec.TaggedDocument(
words=x['text_columns'], tags=[str(x.ID)]), axis=1)
model = doc2vec.Doc2Vec(vector_size=100, min_count=1, epochs=150, workers=cores,
window=5, hs=0, negative=5, sample=1e-5, dm_concat=1)
corpus = dataset['tagged_descriptions'].tolist()
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)
and then it is dumped into a pickle file. The word vectors are used to train a classifier such as random forests to predict movies sentiment.
Now suppose that in production, there is a document entailing some totally new vocabularies. That being said, they were not among the ones present during the training of the doc2vec model. I wonder how to tackle such a case.
As a side note, I am aware of Updating training documents for gensim Doc2Vec model and Gensim: how to retrain doc2vec model using previous word2vec model. However, I would appreciate more lights to be shed on this matter.

A Doc2Vec model will only be able to report trained-up vectors for documents that were present during training, and only be able to infer_vector() new doc-vectors for texts containing words that were present during training. (Unrecognized words passed to .infer_vector() will be ignored, similar to the way any words appearing fewer than min_count times are ignored during training.)
If over time you acquire many new texts with new vocabulary words, and those words are important, you'll have to occasionally re-train the Doc2Vec model. And, after re-training, the doc-vectors from the re-trained model will generally not be comparable to doc-vectors from the original model – so downstream classifiers and other applications using the doc-vectors will need updating, as well.
Your own production/deployment requirements will drive how often this re-training should happen, and old models replaced with newer ones.
(While a Doc2Vec model can be fed new training data at any time, doing so incrementally as a sort of 'fine-tuning' introduces hairy issues of balance between old and new data. And, there's no official gensim support for expanding existing the vocabulary of a Doc2Vec model. So, the most robust course is to retrain from scratch using all available data.)
A few side notes on your example training code:
it's rare for min_count=1 to be a good idea: rare words often serve as 'noise', without sufficient usage examples to model well, and thus 'noise' that just slows/interferes with patterns that can be learned from more common-words
dm_concat=1 is best considered an experimental/advanced mode, as it makes models significantly larger and slower to train, with unproven benefits.
much published work uses just 10-20 training epochs; smaller datasets or smaller docs will sometimes benefit from more, but 150 may be taking a lot of time with very little marginal benefit.

Text classification with Naive Bayes

I am leaning NLP and noticed that TextBlob classification based in Naive Bayes (textblob is Build on top of NLTK) https://textblob.readthedocs.io/en/dev/classifiers.html works fine when training data is list of sentences and does not work at all when training data are individual words (where each word and assigned classification).
Why?

Because you don't have single words in the training data.
Usually the training and evaluation/testing data are supposed to be selected with identical distribution. Biases or skews are usually problematic. In very few cases you can train the model to do one thing and use it to do something else.
In your case, the model likely spreads the weights over the words in the sentence. So when you pick a single word, you only get a small portion of the weight represented.
To get it to work you should add single word examples to your training data.

Text Preprocessing for classification - Machine Learning

what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.
def removeusername(tweet):
return " ".join(word.strip() for word in re.split('#|_', tweet))
def removingSpecialchar(text):
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.
I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!

Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.
Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!
I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.

For preprocessing raw data, you can try:
Stop word removal.
Stemming or Lemmatization.
Exclude terms that are either too common or too rare.
Then a second step preprocessing is possible:
Construct a TFIDF matrix.
Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).
Then you can load result of the second steps into your model.
These are just the most common "method", many others exists.
I will let you check each one of these methods by yourself, but it is a good base.

There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.
Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (e.g the positive has only positives)? how do you define accuracy and why?
Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.
This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.
Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options:
(1) Get more training data (at least 2000)
(2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times
(3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning.
(4) Try regularization techniques

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.