I created a corpus of 30,000 headlines. I want to predict the sentiments of these headlines using advanced supervised machine learning (deep learning) methods such as RNN, LSTM, or DNN.
My question is: Is it possible to train and test a deep learning model with any labeled datasets such as IMDB movie review, amazon review, or yelp review.
For example, suppose we train and test the IMDB movie reviews dataset with RNN which gives us a 92% f1 score.
Then, can I input my unlabeled dataset (30,000 headlines) and predict their sentiments with this trained and tested model?
The reason for asking this question is that I found many blogs and tutorial with code that uses deep learning methods for sentiment analysis. They use the label dataset and train and test the model and short ....accuracy or f1 score. Nobody goes further and input the unlabeled data and "predict" the sentiment with their model. That is why I am wondering whether it is possible or not.
Thanks for your suggestions and advice.
Good question,
Yes, nothing stops you from testing it against your own dataset. However, this is not how this is supposed to be done:
Consider for example You train a model on Amazon reviews and then you are testing it on Movie reviews. So what's different? The distributions of data are different, this may have a lot of side effects. The choice of words, sentences, metaphors would be different in both the sets of reviews.
For eg. Consider this review in the Life Sciences domain:
"This drug partially cures cancer"
This is most likely to output negative sentiment if you had trained on Amazon Review data because cancer is a negative word in other domains. So, there is a need to train different sentiment classifiers for different domains.
Summary:
Trying to Use Data from the same data source wherever possible.
Train and Predict on the same domain data.
Related
i'm totally new in NLP and Bert Model.
What im trying to do right now is Sentiment Analysis on Twitter Trending Hashtag ("neg", "neu", "pos") by using DistilBert Model, but the accurazcy was about 50% ( I tried w Label data taken from Kaggle).
So here is my idea:
(1) First, I will Fine-tunning Distilbertmodel (Model 1) with IMDB dataset,
(2) After that since i've got some data took from Twitter post, i will sentiment analysis them my Model 1 and get Result 2.
(3) Then I will refine-tunning Model 1 with the Result 2 and expecting to have Model (3).
Im not really sure this process has any meaning to make the model more accuracy or not.
Thanks for reading my post.
I'm a little skeptical about your first step. Since the IMDB database is different from your target database, I do not think it will positively affect the outcome of your work. Thus, I would suggest fine-tuning it on a dataset like a tweeter or other social media hashtags; however, if you are only focusing on hashtags and do not care about the text, that might work! My little experience with fine-tuning transformers like BART and BERT shows that the dataset that you are working on should be very similar to your actual data. But in general, you can fine-tune a model with different datasets, and if the datasets are structured for one goal, it can improve the model's accuracy.
If you want to fine-tune a sentiment classification head of BERT for classifying tweets, then I'd recommend a different strategy:
IMDB dataset is a different kind of sentiment - the ratings do not really correspond with short post sentiment, unless you want to focus on tweets regarding movies.
using classifier output as input for further training of that classifier is not really a good approach, because, if the classifier made many mistakes while classifying, these will be reflected in the training, and so the errors will deapen. This is basically creating endogenous labels, which will not really improve your real-world classification.
You should consider other ways of obtaining labelled training data. There are a few good examples for twitter:
Twitter datasets on Kaggle - there are plenty of datasets available containing millions of various tweets. Some of those even contain sentiment labels (usually inferred from emoticons, as these were proven to be more accurate than words in predicting sentiment - for explanation see e.g. Frasincar 2013). So that's probably where you should look.
Stocktwits (if youre interested in financial sentiments)- contain posts that authors can label for sentiments, thus are a perfect way of mining labelled data, if stocks/cryptos is what you're looking for.
Another thing is picking a model that's better for your language, I'd recommend this one. It has been pretrained on 80M tweets, so should provide strong improvements. I believe it even contains a sentiment classification head that you can use.
Roberta Twitter Base
Check out the website for that and guidance for loading the model in your code - it's very easy, just use the following code (this is for sentiment classification):
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
Another benefit of this model is that it has been pretrained from scratch with a vocabulary that contains emojis, meaning it has a deep understanding of them, their typical contexts and co-occurences. This can greatly benefit a social media classification, as many researchers would agree that emojis/emoticons are better predictors of sentiment than normal words.
I am currently working on a system that extracts certain features out of 3D-objects (Voxelgrids to be precise), and i would like to compare those features to automatically made features when it comes to performance (classification) in a tensorflow cNN with some other data, but that is not the point here, just for background.
My idea now was, to take a dataset (modelnet10), train a tensorflow cNN to classify them, and then use what it learned there on my dataset - not to classify, but to extract features.
So i want to throw away everything the cnn does,except for what it takes from the objects.
Is there anyway to get these features? and how do i do that? i certainly have no idea.
Yes, it is possible to train models exclusively for feature extraction. This is called transfer learning where you can either train your own model and then extract the features or you can extract features from pre-trained models and then use it in your task if your task is similar in nature to that of what the pre-trained model was trained for. You can of course find a lot of material online for these topics. However, I am providing some links below which give details on how you can go about it:
https://keras.io/api/applications/
https://keras.io/guides/transfer_learning/
https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/
https://www.kaggle.com/angqx95/feature-extractor-fine-tuning-with-keras
I am trying to create a multiclass classifier to identify topics of Facebook posts from a group of parliament members.
I'm using SimpleTransformers to put together an XML-RoBERTa-based classification model. Is there any way to add an embedding layer with metadata to improve the classifier? (For example, adding the political party to each Facebook post, together with the text itself.)
If you have a lot of training data, I would suggest adding the meta data to the input string (probably separated with [SEP] as another sentence) and just train the classification. The model is certainly strong enough to learn how the metadata interract with the input sentence, given you have enough training examples (my guess is tens of thousands might be enough).
If you do not have enough data, I would suggest running the XLM-RoBERTa only to get the features, independently embed your metadata, concatenate the features, and classify using a multi-layer perceptron. This is proably not doable SimpleTransformers, but it should be quite easy with Huggingface's Transformers if you write the classification code directly in PyTorch.
Hello Guys, PLEASE I NEED YOUR ADVICE ON THIS.
I have a problem with the IBM Visual Recognition Service, I am creating a weed detection Model using IBM Visual Recognition Service. I have carefully labelled and trained my images across the different classes.
The Model Performs well when I test for unseen images that belong to this two classes(CORN AND CHENOPODIUM ALBUM) as indicated below:
But My Major Problem is when I try to test for plants outside my labelled images, The Model Identifies that as part of of my labelled images with a very high accuracy. (Plantain and Cassava)
What might be the reason for this, and how can I correct this issue..??
So you have trained the model with images of CORN AND CHENOPODIUM ALBUM, and then are testing it with images of Plantain and Cassava, is that right?
The general "best practice" for training any machine learning model, whether a classifier or object detector, is to have your training data match the test data as nearly as possible. This is summarized as "You get what you train for."
It is not always possible, but to the extent that you have knowledge of what the test data will be like then you want to sample your training data from a similar distribution.
Think of yourself like a teacher preparing a student for a test. If you teach them Spanish and then the test is in Italian, the results will not be good.
In this case, to detect Plantain and Cassava, you will need to add Plantain and Cassava examples to your training set.
I am planning to do sentiment analysis on the customer reviews (a review can have multiple sentences) using word2vec. I have certain questions regarding this:
Should I train my word2vec model (in gensim) using just the training data? Should I consider the test data for this too?
How should I represent the review for classification? Will this representation take into consideration the order of the word as this is important in representing a review for sentiment analysis?
Basically the answer of your question is already a hot topic of research, here is a research paper that might guide you:
This work is the latest research work I know in this area:
From the paper : Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification
http://ir.hit.edu.cn/~dytang/paper/sswe/acl-slides.pdf
code related to the paper
https://github.com/attardi/deepnl/wiki/Sentiment-Specific-Word-Embeddings
Hope this helps!