I'm trying to find information on how to train a BERT model, possibly from the Huggingface Transformers library, so that the embedding it outputs are more closely related to the context o the text I'm using.
However, all the examples that I'm able to find, are about fine-tuning the model for another task, such as classification.
Would anyone happen to have an example of a BERT fine-tuning model for masked tokens or next sentence prediction, that outputs another raw BERT model that is fine-tuned to the context?
Thanks!
Here is an example from the Transformers library on Fine tuning a language model for masked token prediction.
The model that is used is one of the BERTForLM familly. The idea is to create a dataset using the TextDataset that tokenizes and breaks the text into chunks. Then use a DataCollatorForLanguageModeling to randomly mask tokens in the chunks when traing, and pass the model, the data and the collator to the Trainer to train and evaluate the results.
Related
I am new to NLP and i am confused about the embedding.
Is it possible, if i already have trained GloVe embeddings / or Word2Vec embeddings and send these into Transformer? Or does the Transformer needs raw data and do its own embedding?
(Language: python, keras)
If you train a new transformer, you can do whatever you want with the bottom layer.
Most likely you are asking about pretrained transformers, though. Pretrained transformers such as Bert will have their own embeddings of the word pieces. In that case, you will probably get sufficient results just by using the results of the transformer.
Per https://en.wikipedia.org/wiki/BERT_(language_model)
BERT models are pre-trained from unlabeled data extracted from the
BooksCorpus with 800M words and English Wikipedia with 2,500M
words.
Whether to train your model depends on your data.
For simple english text, the out-of-the-box model should work well.
If your data concentrates on certain domain e.g. job requisitions and job applications, then you can extend the model by training it on your corpus (aka transfer learning).
https://huggingface.co/docs/transformers/training
I want train a spacy custom NER model,which is the best option?
the train data is ready (doccano)
option 1. use an existing pre-trained spacy model and update it with custom NER?.
option 2. create an empty model using spacy.blank() with custom NER?
I just want to identify my custom entity in a text, the other types of entities are not necessary...currently
You want to leverage transfer learning as much as possible: this means you most likely want to use a pre-trained model (e.g. on Wikipedia data) and fine-tune it for your use case. This is because training a spacy.blank model from scratch will require lots of data, whereas fine tuning a pretrained model might require as few as a couple hundreds labels.
However, pay attention to catastrophic forgetting which is the fact that when fine-tuning on some of your new labels, the model might 'forget' some of the old labels because they are no longer present in the training set.
For example, let's say you are trying to label the entity DOCTOR on top of a pre-trained NER model that labels LOC, PERSON and ORG. You label 200 DOCTOR records and fine tune your model with them. You might find that the model now predicts every PERSON as a DOCTOR.
That's all one can say without knowing more about your data. Please check out the spacy docs on training ner for more details.
I have my own corpus of plain text. I want to train a Bert model in TensorFlow, similar to gensim's word2vec to get the embedding vectors for each word.
What I have found is that all the examples are related to any downstream NLP tasks like classification. But, I want to train a Bert model with my custom corpus after which I can get the embedding vectors for a given word.
Any lead will be helpful.
If you have access to the required hardware, you can dig into NVIDIA's training scripts for BERT using TensorFlow. The repo is here. From the medium article:
BERT-large can be pre-trained in 3.3 days on four DGX-2H nodes (a
total of 64 Volta GPUs).
If you don't have an enormous corpus, you will probably have better results fine-tuning an available model. If you would like to do so, you can look into huggingface's transformers.
I'm currently working on NLP project. Actually, when i researched how to deal with NLP, i found some articles about SpaCy. But, because i'm still newbie on python, i don't understand how SpaCy TextCategorizer Pipeline works.
Is there any detailed about how this pipeline works? Is TextCategorizer Pipeline also using text feature extraction such as Bag of Words, TF-IDF, Word2Vec or anything else? And what model architecture use in SpaCy TextCategorizer? Is there someone who could explain me about this?
There's a lot of info in the docs:
https://spacy.io/usage/examples#textcat shows a code example
https://spacy.io/api/textcategorizer provides details on the architecture:
The model supports classification with multiple, non-mutually exclusive labels. You can change the model architecture rather easily, but by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document. The TextCategorizer uses its own CNN model, to avoid sharing weights with the other pipeline components. The document tensor is then summarized by concatenating max and mean pooling, and a multilayer perceptron is used to predict an output vector of length nr_class, before a logistic activation is applied elementwise. The value of each output neuron is the probability that some class is present.
I am using Gensim Library in python for using and training word2vector model. Recently, I was looking at initializing my model weights with some pre-trained word2vec model such as (GoogleNewDataset pretrained model). I have been struggling with it couple of weeks. Now, I just searched out that in gesim there is a function that can help me to initialize the weights of my model with pre-trained model weights. That is mentioned below:
reset_from(other_model)
Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.
I don't know this function can do the same thing or not. Please help!!!
You can now do incremental training with gensim. I would recommend loading the pretrained model and then doing an update.
from gensim.models import Word2Vec
model = Word2Vec.load('pretrained_model.emb')
model.build_vocab(new_sentences, update=True)
model.train(new_sentences)