Can anyone list in simple terms tasks involved in building a BERT text classifier for someone new to CS working on their first project? Mine involves taking a list of paragraph length humanitarian aid activity descriptions (with corresponding titles and sector codes in the CSV file) and building a classifier able to assign sector codes to the descriptions, using a separate list of sector codes and their sentence long descriptions. For training, testing and evaluation, I'll compare the codes my classifier generates with those in the CSV file.
Any thoughts on high level tasks/steps involved to help me make my project task checklist? I started a Google CoLab notebook, made two CSV files, put them in a Google cloud bucket and I guess I have to pull the files, tokenize the data and ? Ideally I'd like to stick with Google tools too.
As the comments say, I suggest you to start with a blog or tutorial. The common tasks to use tensorflow BERT's model is to use the tensorflow_hub. There you have 2 modules: BERT preprocessor and BERT encoder. Bert preprocessor prepares your data (with tokenization) and the next one transforms the data into mathematical language representation. If you are trying to use cosine similarities between 2 utterances, I have to say, BERT is not made for this type of process.
It is normal to use BERT as a step to reach an objective, not an objective itself. That is, build a model that uses BERT, but for the beginning, use just BERT to understand how it works.
BERT preprocess
It has multiple keys (its output it's a dict):
dict_keys(['input_mask', 'input_type_ids', 'input_word_ids'])
Respectively, there are the "where are the tokens", "the shape of the inputs" and "the token number of them"
BERT encoder
It has multiple keys (its output it's a dict):
dict_keys(['default', 'encoder_outputs', 'pooled_output', 'sequence_output'])
In order, "same as pooled_output", "the output of the encoders", "the context of each utterance", "the context of each token inside the utterance".
Take a look here (search for bert)
Also watch this question I made
Related
I want to classify the functions of sentences in the abstracts of scientific papers, and the function of a sentence is related to the functions of its surrounding sentences.
I found the model proposed in this paper very useful and straightforward, it just fed the BERT model with multiple sentences with multiple [SEP] tokens to separate them. See the figure below:
I can train (fine-tune) this model using their codes, but I would also like to build this model using the transformers library (instead of allennlp) because it gives me more flexibility.
The most difficult problem for me is how to extract the embeddings of all [SEP] tokens from a sample (multiple sentences). I tried to read their code but found it quite difficult for me to follow. Could you help me with this procedure?
Thanks in advance!
My research interest is effect of emojis in text. I am trying to classify sarcastic tweets in text. A month ago I have used a dataset where I added the tokens using:
tokenizer.add_tokens('List of Emojis').
So when I tested the BERT model had successfully added the tokens. But 2 days ago when I did the same thing for another dataset, BERT model has categorized then as 'UNK' tokens. My question is, is there a recent change in the BERT model? I have tried it with the following tokenizer,
BertTokenizer.from_pretrained('bert-base-uncased')
This is same for distilbert. It does not recognize the emojis despite explicitly adding them. At first I read somewhere there is no need to add them in the tokenizer because BERT or distilbert has already those emojis in the 30000 tokens but I tried both. By adding and without adding. For both cases it does not recognize the emojis.
What can I do to solve this issue. Your thoughts on this would be appreciated.
You might need to distinguish between a BERT model (the architecture) and a pre-trained BERT model. The former can definitely support emoji; the latter will only have reserved code points for them if they were in the data that was used to create the WordPiece tokenizer.
Here is an analysis of the 119,547 WordPiece vocab used in the HuggingFace multilingual model It does not mention emoji. Note that 119K is very large for a vocab; more normal is 8K, 16K or 32K. The size of the vocab has quite a big influence on the model size: the first and last layers of a Transformer (e.g. BERT) model have way more weights than between any other layer.
I've just been skimming how the paper Time to Take Emoji Seriously: They Vastly Improve Casual Conversational Models deals with it. They append 3267 emoji to the end of the vocabulary. Then train it on some data with emoji in so it can try and learn what to do with those new characters.
BTW, a search of the HuggingFace github repository found they are using from emoji import demojize. This sounds like they convert emoji into text. Depending on what you are doing, you might need to disable it, or conversely you might need to be using that in your pipeline.
I'm working on word2vec model in order to analysis a corpus of newspaper.
I have a csv which contains some newspaper like tital, journal, and the content of the article.
I know how to train my model in order to get most similar words and their context.
However, I want to do a sentiment analysis on that. I found some ressources in order to do that but in all the test or train dataframe in the examples, there is already a column sentiment (0 or 1). Do you if it's possible to classify automaticaly texts by sentiment ? I mean put 0 or 1 to each text. I search but i don't find any references about that in the word2vec or doc2vec documentation...
Thanks for advance !
Both Word2Vec & Doc2Vec are just ways to turn words or lists-of-words into 'dense' vectors. Alone, they won't tell you sentiment.
When you have a text and want to deduce which categories it belongs to, that's called 'text classification'. Specifically, if you have just two categories (like 'positive-sentiment' vs 'negative-sentiment', or 'spam' vs 'not-spam'), that's called 'binary classification'.
The output of a Word2Vec or Doc2Vec model might be helpful in that task, but mainly as input to some other chosen 'classifier' algorithm. And, such algorithms require some 'labeled examples' of each kind of text - where you supply the right answer – in order to work. So, you will likely have to go through your corpus of newspaper articles & mark a bunch of them with the answer you want.
You should start by working through some examples that use scikit-learn, the most popular Python library with text-classification tools, even without any Word2Vec or Doc2Vec features, at first. For example, in its docs is an intro:
"Working With Text Data"
Only after you've set up some basic code using generic preprocess/feature-extraction/training/evaluation steps, and reviewed some actual results, should you then consider if adding some features based on Word2Vec or Doc2Vec might help.
I want to identify the Category/business_domain of the website's business to which it belongs.
For ex. superhuman website. The company made Email client powered by buzzword features & UI.
So in short Category of website can be Professional Email services.
So, to get this done, some of my initials thoughts are applying LDA algorithm (python module) on About_us text of a website & company's Facebook info page, given that we have these both. But still this approach is not working in many cases. Any insights?
LDA details:
using 20000 passes and 1 topic, my results for http://aakritiartgallery.com/ website is
[(0, u'0.050*art + 0.020*aakriti + 0.019*contemporary + 0.017*gallery + 0.015*new')]
How can i narrow down to my business with these term probablities given by LDA?
#Anony-Mousse said it well, it would help to make a roadplan instead of fixating on a single algorithm. Given your situation, this is what I would do.
Preprocessing/Feature Extraction
NMF, LSA, LDA are unsupervised techniques mostly used in preprocessing to extract meaning features. In NLP, this usually corresponds to extracting meaningful words in large amounts of text. By using these techniques, you would be able to process raw data to gain meaningful features. These algorithms by themselves do not offer predictions, and they are usually not enough to create a good model.
Training
In your case, you would need structured data to train your model and make predictions. For instance, you can use your results of your LDA (you would actually use indices of these keywords) mapped to a business domain (or your label).
i.e)
(label)IT : (features) java, python, server
(label)Zoo: (features) monkey, zebra, giraffe
(label)IT : (features) nlp, machine learning
After you have gathered some data (at the very least (#features * #label)), you can train a supervised model of your choice. (Log Reg, SVM, NN, etc.)
Testing
Evaluate your prediction score and implement algorithm.
Having said this, this would be no easy task. You would have to deal with identifying categories/subcategories, other means of extracting meaningful features, etc so I would put a long timeframe on this project. Good Luck!
Get training data
Train a classifier
Classify!
What I am going to ask may sound very similar to the post Sentiment analysis with NLTK python for sentences using sample data or webservice? , But I am done with Parsing and Tokenization of sentences from text. My question is
Whatever examples till now I have seen in NLTK movie review example seems to be most similar to my problem, But for movie_review the training text is already in a form as it has two folders pos and neg and text are stored there. How can I do that classification for my huge text, Do I read data manually and store them into two folders. Does that make the corpus. After that can I work with them just like movie_review data in example?
2.If the answer to the above question is yes, is there any way to speed up that task by any tool. For example I want to work with only the texts which has "Monty Python" in there content. And then I classify them manually and then store them in pos and neg folder. Does that work?
Please help me
Yes, you need a training corpus to train a classifier. Or you need some other way to detect sentiment.
To create a training corpus, you can classify by hand, you can have others classify it for you (mechanical turk is popular for this), or you can do corpus bootstrapping. For sentiment, that could involve creating 2 lists of keywords, positive words and negative words. Using those, you can create an initial training corpus, correct it by hand, then train a classifier. This is an iterative process, and the key thing to remember is "garbage in, garbage out". In other words, if your training corpus is wrong, you can't expect your classifier to be right.