I want to add more words to the SpaCy model in portuguese so that I can use the PoS (part of speech) of a specific domain, but I don't want to add isolated words but sentences. I did these three steps:
I converted the "PetroTok-UDPIPE.conllu" file (freely available here: http://petroles.ica.ele.puc-rio.br/, this is inside of the "PetroTok" file and contains sentences (not individual words) with their respective PoS and lemmas) to a binary "PetroTok-UDPIPE.spacy" file, with the following command (indicated on the SpaCy page: https://spacy.io/usage/training#data):
python -m spacy convert PetroTok-UDPIPE.conllu .
This created the "PetroTok-UDPIPE.spacy" file.
Then, I created the "base_config.cfg" file (as indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
changing the values of "train" and "dev" to:
train = "PetroTok-UDPIPE.spacy"
dev = "PetroTok-UDPIPE.spacy"
(In this case I am considering the same data for train and validation, just for testing).
Having that file I use the following command line to create the "config.cfg" file (also indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
python -m spacy init fill-config base_config.cfg config.cfg
I apply the following command to create the model (as indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
python -m spacy train config.cfg --output ./output
That prints the following output:
...When testing a simple code loading the created model in the paste "output" it returns empty lists for the ".lemma_" and ".pos_" of the string "INTRODUÇÃO.":
lemma = ['', '']
pos = ['', '']
Could you please help me to identify the implicit error? I have another question, the model created in this way is created only with the "PetroTok-UDPIPE.conllu" file or is it a model that incorporates elements to the model in Portuguese (in this case)?
Thank you.
Your model is probably setting the .tag_ attribute but not the .pos_ attribute.
In the official models, what happens is that language-specific tags (.tag_) are learned by the model, and then an AttributeRuler maps them to Universal Dependency tags (.pos_). The quickstart doesn't configure that by default because there's different ways to do it, so you just get .tag_.
I have another question, the model created in this way is created only with the "PetroTok.conllu" file or is it a model that incorporates elements to the model in Portuguese (in this case)?
The model will learn from scratch unless you tell it to do otherwise. Retraining a model without the other data is prone to catastrophic forgetting and not recommended, and training on two datasets for the same task with different tagsets sounds infeasible.
Related
I'm training a custom named entity recognition model, I created the config.cfg and train.spacy files, among all it has, I'm using this as pre-trained vectors en_core_web_lg
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null
I then train the model using
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy
This works and I can see the output model.
Then I want to train another NER model that has nothing to do with the previous one (same code different data) and I get this error:
Error: [E884] The pipeline could not be initialized because the vectors could not be found at 'en_core_web_lg'.
If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.
It looks like it modified the base en_core_web_lg model, which can be a problem for me since I use it for different models, some fine-tuned and others just out of the box.
How can I train this NER model making sure the downloaded en_core_web_lg model is not modified? and would this ensure that I can train several models without interfering with each other?
When you use a model as a source of vectors, or for that matter a source for any other part of a pipeline, spaCy will not modify it under any circumstances. Something else is going on.
Are you perhaps using a different virtualenv? Does spacy.load("en_core_web_lg") work?
One thing that could be happening (but seems less likely) is that in some fields, you can use the name of an installed pipeline (using entry points) or a local path. If you have a directory named en_core_web_lg where you are training that could be checked first.
I'm trying to compare the benefits of word vectors trained on in-domain data to word vectors trained on non-domain specific data on spaCy NER.
I have built two word2vec models, each one trained on different text.
Then, I try to build and evaluate two spaCy models, each with a different txt file containing the word2vec embeddings.
This is the code that I use to initiate the model:
!python -m spacy init vectors en models/w2v/merged/word2vec_merged_w2v.txt models/w2v/merged --name en_test
It runs succesfully, and both create different amounts of vectors.
1st model:
Creating blank nlp object for language 'en'
[+] Successfully converted 28093 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Ruben Weijers\models\w2v\merged
2nd model:
Creating blank nlp object for language 'en'
[+] Successfully converted 34712 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Ruben Weijers\models\w2v\quine
The vectors are different. I then try to load and train using the following code:
nlp = spacy.load("models/w2v/merged")
nlp.add_pipe('ner')
nlp.to_disk("models/w2v/merged")
#train model
#train model
!python -m spacy train models/w2v/merged/config.cfg --output models/w2v/merged --paths.train data/train2.spacy --paths.dev data/test2.spacy --paths.vectors models/w2v/merged
I do the same for the other model, pathing to the other vector file ofcourse.
However, the training pipeline shows that both models have exactly the same precision, recall and loss rate throughout the learning process.
Also, when I call:
#evaluate model
!python -m spacy evaluate models/w2v/merged/model-best data/val2.spacy --output models/w2v/merged/metrics.json
Both models have the same performance metrics, while the vectors are totally different.
I have looked up different videos on how to path vectors, I have added paths to the vectors to the config files, on top of adding
--paths.vectors models/w2v/merged
All doesn't seem to help. Many videos show how to implement word2vec, yet don't evaluate. I'm curious to see as to why both word2vec models appear to be exactly the same. It doesn't make sense. I have checked that pathing is correct, and files are in the correct place multiple times. It doesn't seem like that's the issue since the different numbers returned in creation of vectors also shows that the vector files are different.
I have created the word vectors using:
def train_w2v_model(model_name):
w2v_model = Word2Vec(min_count=5,
window=2,
vector_size=500,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30)
w2v_model.save(f"downloads/{model_name}.model")
w2v_model.wv.save_word2vec_format(f"downloads/word2vec_{model_name}.txt")
I know it's been a along time since this question was asked, but I ran into the same problem and I found the solution, so I thought it could be useful for somebody else to share it.
Instead of
nlp.add_pipe('ner')
you should do this:
!python -m spacy init config -p ner -o accuracy config.cfg
Can anyone list in simple terms tasks involved in building a BERT text classifier for someone new to CS working on their first project? Mine involves taking a list of paragraph length humanitarian aid activity descriptions (with corresponding titles and sector codes in the CSV file) and building a classifier able to assign sector codes to the descriptions, using a separate list of sector codes and their sentence long descriptions. For training, testing and evaluation, I'll compare the codes my classifier generates with those in the CSV file.
Any thoughts on high level tasks/steps involved to help me make my project task checklist? I started a Google CoLab notebook, made two CSV files, put them in a Google cloud bucket and I guess I have to pull the files, tokenize the data and ? Ideally I'd like to stick with Google tools too.
As the comments say, I suggest you to start with a blog or tutorial. The common tasks to use tensorflow BERT's model is to use the tensorflow_hub. There you have 2 modules: BERT preprocessor and BERT encoder. Bert preprocessor prepares your data (with tokenization) and the next one transforms the data into mathematical language representation. If you are trying to use cosine similarities between 2 utterances, I have to say, BERT is not made for this type of process.
It is normal to use BERT as a step to reach an objective, not an objective itself. That is, build a model that uses BERT, but for the beginning, use just BERT to understand how it works.
BERT preprocess
It has multiple keys (its output it's a dict):
dict_keys(['input_mask', 'input_type_ids', 'input_word_ids'])
Respectively, there are the "where are the tokens", "the shape of the inputs" and "the token number of them"
BERT encoder
It has multiple keys (its output it's a dict):
dict_keys(['default', 'encoder_outputs', 'pooled_output', 'sequence_output'])
In order, "same as pooled_output", "the output of the encoders", "the context of each utterance", "the context of each token inside the utterance".
Take a look here (search for bert)
Also watch this question I made
This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem:
I would like to know how I could create my own vocab.bpe file.
I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file.
I have reviewed the code in gpt-2/src/encoder.py but, I have not been able to find any hint. Any help or idea?
Thank you so much in advance.
check out here, you can easily create the same vocab.bpe using the following command:
python learn_bpe -o ./vocab.bpe -i dataset.txt --symbols 50000
I haven't worked with GPT2, but bpemb is a very good place to start for subword embeddings. According to the README
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.
I've used the pretrained embeddings for one of my projects along with sentencepiece and it turned out to be very useful.
i want to get the cosine similarity between sentences. I have tested doc2vec with gensim and trained it with only few sentences given in the code. But I want to train my model using a text document that have one sentence per each line. How can I use a document with sentences?
If your document is already in the form of a text file, with one-sentence-per-line, then many of the examples included with gensim (or elsewhere) show how to handle such a corpus.
For example, there's an introductory Doc2Vec tutorial notebook bundled with gensim in its docs/notebooks directory, which you can also view online at the project github repository:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Its cell (3) shows, and cell (4) uses, a function to read a file line-by-line, and turn it into the TaggedDocument texts that the model requires.