Emojis are regarded as unknown(UNK) in BERT - python

My research interest is effect of emojis in text. I am trying to classify sarcastic tweets in text. A month ago I have used a dataset where I added the tokens using:
tokenizer.add_tokens('List of Emojis').
So when I tested the BERT model had successfully added the tokens. But 2 days ago when I did the same thing for another dataset, BERT model has categorized then as 'UNK' tokens. My question is, is there a recent change in the BERT model? I have tried it with the following tokenizer,
BertTokenizer.from_pretrained('bert-base-uncased')
This is same for distilbert. It does not recognize the emojis despite explicitly adding them. At first I read somewhere there is no need to add them in the tokenizer because BERT or distilbert has already those emojis in the 30000 tokens but I tried both. By adding and without adding. For both cases it does not recognize the emojis.
What can I do to solve this issue. Your thoughts on this would be appreciated.

You might need to distinguish between a BERT model (the architecture) and a pre-trained BERT model. The former can definitely support emoji; the latter will only have reserved code points for them if they were in the data that was used to create the WordPiece tokenizer.
Here is an analysis of the 119,547 WordPiece vocab used in the HuggingFace multilingual model It does not mention emoji. Note that 119K is very large for a vocab; more normal is 8K, 16K or 32K. The size of the vocab has quite a big influence on the model size: the first and last layers of a Transformer (e.g. BERT) model have way more weights than between any other layer.
I've just been skimming how the paper Time to Take Emoji Seriously: They Vastly Improve Casual Conversational Models deals with it. They append 3267 emoji to the end of the vocabulary. Then train it on some data with emoji in so it can try and learn what to do with those new characters.
BTW, a search of the HuggingFace github repository found they are using from emoji import demojize. This sounds like they convert emoji into text. Depending on what you are doing, you might need to disable it, or conversely you might need to be using that in your pipeline.

Related

How can I implement this BERT model for sequential sentences classification using HuggingFace?

I want to classify the functions of sentences in the abstracts of scientific papers, and the function of a sentence is related to the functions of its surrounding sentences.
I found the model proposed in this paper very useful and straightforward, it just fed the BERT model with multiple sentences with multiple [SEP] tokens to separate them. See the figure below:
I can train (fine-tune) this model using their codes, but I would also like to build this model using the transformers library (instead of allennlp) because it gives me more flexibility.
The most difficult problem for me is how to extract the embeddings of all [SEP] tokens from a sample (multiple sentences). I tried to read their code but found it quite difficult for me to follow. Could you help me with this procedure?
Thanks in advance!

Does Fine-tunning Bert Model in multiple times with different dataset make it more accuracy?

i'm totally new in NLP and Bert Model.
What im trying to do right now is Sentiment Analysis on Twitter Trending Hashtag ("neg", "neu", "pos") by using DistilBert Model, but the accurazcy was about 50% ( I tried w Label data taken from Kaggle).
So here is my idea:
(1) First, I will Fine-tunning Distilbertmodel (Model 1) with IMDB dataset,
(2) After that since i've got some data took from Twitter post, i will sentiment analysis them my Model 1 and get Result 2.
(3) Then I will refine-tunning Model 1 with the Result 2 and expecting to have Model (3).
Im not really sure this process has any meaning to make the model more accuracy or not.
Thanks for reading my post.
I'm a little skeptical about your first step. Since the IMDB database is different from your target database, I do not think it will positively affect the outcome of your work. Thus, I would suggest fine-tuning it on a dataset like a tweeter or other social media hashtags; however, if you are only focusing on hashtags and do not care about the text, that might work! My little experience with fine-tuning transformers like BART and BERT shows that the dataset that you are working on should be very similar to your actual data. But in general, you can fine-tune a model with different datasets, and if the datasets are structured for one goal, it can improve the model's accuracy.
If you want to fine-tune a sentiment classification head of BERT for classifying tweets, then I'd recommend a different strategy:
IMDB dataset is a different kind of sentiment - the ratings do not really correspond with short post sentiment, unless you want to focus on tweets regarding movies.
using classifier output as input for further training of that classifier is not really a good approach, because, if the classifier made many mistakes while classifying, these will be reflected in the training, and so the errors will deapen. This is basically creating endogenous labels, which will not really improve your real-world classification.
You should consider other ways of obtaining labelled training data. There are a few good examples for twitter:
Twitter datasets on Kaggle - there are plenty of datasets available containing millions of various tweets. Some of those even contain sentiment labels (usually inferred from emoticons, as these were proven to be more accurate than words in predicting sentiment - for explanation see e.g. Frasincar 2013). So that's probably where you should look.
Stocktwits (if youre interested in financial sentiments)- contain posts that authors can label for sentiments, thus are a perfect way of mining labelled data, if stocks/cryptos is what you're looking for.
Another thing is picking a model that's better for your language, I'd recommend this one. It has been pretrained on 80M tweets, so should provide strong improvements. I believe it even contains a sentiment classification head that you can use.
Roberta Twitter Base
Check out the website for that and guidance for loading the model in your code - it's very easy, just use the following code (this is for sentiment classification):
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
Another benefit of this model is that it has been pretrained from scratch with a vocabulary that contains emojis, meaning it has a deep understanding of them, their typical contexts and co-occurences. This can greatly benefit a social media classification, as many researchers would agree that emojis/emoticons are better predictors of sentiment than normal words.

BERT Text Classification Tasks for Beginners

Can anyone list in simple terms tasks involved in building a BERT text classifier for someone new to CS working on their first project? Mine involves taking a list of paragraph length humanitarian aid activity descriptions (with corresponding titles and sector codes in the CSV file) and building a classifier able to assign sector codes to the descriptions, using a separate list of sector codes and their sentence long descriptions. For training, testing and evaluation, I'll compare the codes my classifier generates with those in the CSV file.
Any thoughts on high level tasks/steps involved to help me make my project task checklist? I started a Google CoLab notebook, made two CSV files, put them in a Google cloud bucket and I guess I have to pull the files, tokenize the data and ? Ideally I'd like to stick with Google tools too.
As the comments say, I suggest you to start with a blog or tutorial. The common tasks to use tensorflow BERT's model is to use the tensorflow_hub. There you have 2 modules: BERT preprocessor and BERT encoder. Bert preprocessor prepares your data (with tokenization) and the next one transforms the data into mathematical language representation. If you are trying to use cosine similarities between 2 utterances, I have to say, BERT is not made for this type of process.
It is normal to use BERT as a step to reach an objective, not an objective itself. That is, build a model that uses BERT, but for the beginning, use just BERT to understand how it works.
BERT preprocess
It has multiple keys (its output it's a dict):
dict_keys(['input_mask', 'input_type_ids', 'input_word_ids'])
Respectively, there are the "where are the tokens", "the shape of the inputs" and "the token number of them"
BERT encoder
It has multiple keys (its output it's a dict):
dict_keys(['default', 'encoder_outputs', 'pooled_output', 'sequence_output'])
In order, "same as pooled_output", "the output of the encoders", "the context of each utterance", "the context of each token inside the utterance".
Take a look here (search for bert)
Also watch this question I made

Can I add a layer of meta data in a text classification model?

I am trying to create a multiclass classifier to identify topics of Facebook posts from a group of parliament members.
I'm using SimpleTransformers to put together an XML-RoBERTa-based classification model. Is there any way to add an embedding layer with metadata to improve the classifier? (For example, adding the political party to each Facebook post, together with the text itself.)
If you have a lot of training data, I would suggest adding the meta data to the input string (probably separated with [SEP] as another sentence) and just train the classification. The model is certainly strong enough to learn how the metadata interract with the input sentence, given you have enough training examples (my guess is tens of thousands might be enough).
If you do not have enough data, I would suggest running the XLM-RoBERTa only to get the features, independently embed your metadata, concatenate the features, and classify using a multi-layer perceptron. This is proably not doable SimpleTransformers, but it should be quite easy with Huggingface's Transformers if you write the classification code directly in PyTorch.

How does spacy use word embeddings for Named Entity Recognition (NER)?

I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron.
However, nowhere in the code does it appear that spaCy uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus).
My questions are -
Are these used in the NER system now?
If I were to switch out the word vectors to a different set, should I expect performance to change in a meaningful way?
Where in the code can I find out how (if it all) spaCy is using the word vectors?
I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings.
spaCy does use word embeddings for its NER model, which is a multilayer CNN. There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here. All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link.
It's quite doable to add custom vectors. There's an overview of the process in the spaCy docs, plus some example code on Github.

Categories