I'm training a custom named entity recognition model, I created the config.cfg and train.spacy files, among all it has, I'm using this as pre-trained vectors en_core_web_lg
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null
I then train the model using
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy
This works and I can see the output model.
Then I want to train another NER model that has nothing to do with the previous one (same code different data) and I get this error:
Error: [E884] The pipeline could not be initialized because the vectors could not be found at 'en_core_web_lg'.
If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.
It looks like it modified the base en_core_web_lg model, which can be a problem for me since I use it for different models, some fine-tuned and others just out of the box.
How can I train this NER model making sure the downloaded en_core_web_lg model is not modified? and would this ensure that I can train several models without interfering with each other?
When you use a model as a source of vectors, or for that matter a source for any other part of a pipeline, spaCy will not modify it under any circumstances. Something else is going on.
Are you perhaps using a different virtualenv? Does spacy.load("en_core_web_lg") work?
One thing that could be happening (but seems less likely) is that in some fields, you can use the name of an installed pipeline (using entry points) or a local path. If you have a directory named en_core_web_lg where you are training that could be checked first.
Related
I'm trying to reload a DistilBertForSequenceClassification model I've fine-tuned and use that to predict some sentences into their appropriate labels (text classification).
In google Colab, after successfully training the BERT model, I downloaded it after saving:
trainer.train()
trainer.save_model("distilbert_classification")
The downloaded model has three files: config.json, pytorch_model.bin, training_args.bin.
I moved them encased in a folder named 'distilbert_classification' somewhere in my google drive.
afterwards, I reloaded the model in a different Colab notebook:
reloadtrainer = DistilBertForSequenceClassification.from_pretrained('google drive directory/distilbert_classification')
Up to this point, I have succeeded without any errors.
However, how to I use this reloaded model (the 'reloadtrainer' object) to actually make the predictions on sentences? What is the code I need to use afterwards? I tried
reloadtrainer .predict("sample sentence") but it doesn't work. Would appreciate any help!
Remember that you also need to tokenize the input to your model, just like in the training phase. Merely feeding a sentence to the model will not work (unless you use pipelines() but that's another discussion).
You may use an AutoModelForSequenceClassification() and AutoTokenizer() to make things easier.
Note that the way I am saving the model is via model.save_pretrained("path_to_model") rather than model.save().
One possible approach could be the following (say you trained with uncased distilbert):
model = AutoModelForSequenceClassification.from_pretrained("path_to_model")
# Replace with whatever tokenizer you used
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)
input_text = "This is the text I am trying to classify."
tokenized_text = tokenizer(input_text,
truncation=True,
is_split_into_words=False,
return_tensors='pt')
outputs = model(tokenized_text["input_ids"])
predicted_label = outputs.logits.argmax(-1)
I'm trying to compare the benefits of word vectors trained on in-domain data to word vectors trained on non-domain specific data on spaCy NER.
I have built two word2vec models, each one trained on different text.
Then, I try to build and evaluate two spaCy models, each with a different txt file containing the word2vec embeddings.
This is the code that I use to initiate the model:
!python -m spacy init vectors en models/w2v/merged/word2vec_merged_w2v.txt models/w2v/merged --name en_test
It runs succesfully, and both create different amounts of vectors.
1st model:
Creating blank nlp object for language 'en'
[+] Successfully converted 28093 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Ruben Weijers\models\w2v\merged
2nd model:
Creating blank nlp object for language 'en'
[+] Successfully converted 34712 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Ruben Weijers\models\w2v\quine
The vectors are different. I then try to load and train using the following code:
nlp = spacy.load("models/w2v/merged")
nlp.add_pipe('ner')
nlp.to_disk("models/w2v/merged")
#train model
#train model
!python -m spacy train models/w2v/merged/config.cfg --output models/w2v/merged --paths.train data/train2.spacy --paths.dev data/test2.spacy --paths.vectors models/w2v/merged
I do the same for the other model, pathing to the other vector file ofcourse.
However, the training pipeline shows that both models have exactly the same precision, recall and loss rate throughout the learning process.
Also, when I call:
#evaluate model
!python -m spacy evaluate models/w2v/merged/model-best data/val2.spacy --output models/w2v/merged/metrics.json
Both models have the same performance metrics, while the vectors are totally different.
I have looked up different videos on how to path vectors, I have added paths to the vectors to the config files, on top of adding
--paths.vectors models/w2v/merged
All doesn't seem to help. Many videos show how to implement word2vec, yet don't evaluate. I'm curious to see as to why both word2vec models appear to be exactly the same. It doesn't make sense. I have checked that pathing is correct, and files are in the correct place multiple times. It doesn't seem like that's the issue since the different numbers returned in creation of vectors also shows that the vector files are different.
I have created the word vectors using:
def train_w2v_model(model_name):
w2v_model = Word2Vec(min_count=5,
window=2,
vector_size=500,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30)
w2v_model.save(f"downloads/{model_name}.model")
w2v_model.wv.save_word2vec_format(f"downloads/word2vec_{model_name}.txt")
I know it's been a along time since this question was asked, but I ran into the same problem and I found the solution, so I thought it could be useful for somebody else to share it.
Instead of
nlp.add_pipe('ner')
you should do this:
!python -m spacy init config -p ner -o accuracy config.cfg
I want to add more words to the SpaCy model in portuguese so that I can use the PoS (part of speech) of a specific domain, but I don't want to add isolated words but sentences. I did these three steps:
I converted the "PetroTok-UDPIPE.conllu" file (freely available here: http://petroles.ica.ele.puc-rio.br/, this is inside of the "PetroTok" file and contains sentences (not individual words) with their respective PoS and lemmas) to a binary "PetroTok-UDPIPE.spacy" file, with the following command (indicated on the SpaCy page: https://spacy.io/usage/training#data):
python -m spacy convert PetroTok-UDPIPE.conllu .
This created the "PetroTok-UDPIPE.spacy" file.
Then, I created the "base_config.cfg" file (as indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
changing the values of "train" and "dev" to:
train = "PetroTok-UDPIPE.spacy"
dev = "PetroTok-UDPIPE.spacy"
(In this case I am considering the same data for train and validation, just for testing).
Having that file I use the following command line to create the "config.cfg" file (also indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
python -m spacy init fill-config base_config.cfg config.cfg
I apply the following command to create the model (as indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
python -m spacy train config.cfg --output ./output
That prints the following output:
...When testing a simple code loading the created model in the paste "output" it returns empty lists for the ".lemma_" and ".pos_" of the string "INTRODUÇÃO.":
lemma = ['', '']
pos = ['', '']
Could you please help me to identify the implicit error? I have another question, the model created in this way is created only with the "PetroTok-UDPIPE.conllu" file or is it a model that incorporates elements to the model in Portuguese (in this case)?
Thank you.
Your model is probably setting the .tag_ attribute but not the .pos_ attribute.
In the official models, what happens is that language-specific tags (.tag_) are learned by the model, and then an AttributeRuler maps them to Universal Dependency tags (.pos_). The quickstart doesn't configure that by default because there's different ways to do it, so you just get .tag_.
I have another question, the model created in this way is created only with the "PetroTok.conllu" file or is it a model that incorporates elements to the model in Portuguese (in this case)?
The model will learn from scratch unless you tell it to do otherwise. Retraining a model without the other data is prone to catastrophic forgetting and not recommended, and training on two datasets for the same task with different tagsets sounds infeasible.
I used a spacy blank model with Gensim custom word vectors. Then I trained the model to get the pipeline in the respective order-
entityruler1, ner1, entity ruler2, ner2
After training it, I saved it in a folder through
nlp.to_disk('path to folder')
However, if I try to load the same model using
nlp1 = spacy.load('path to folder')
It gives me this error-
ValueError: [E109] Model for component 'ner' not initialized. Did you forget to load a model, or forget to call begin_training()?
I cannot find any solution online. What might be the reason I am getting this? How do I successfully load and use my pretrained model?
Upgrading to spacy version 2.3.7 resolved this error. :)
is it possible to identify in spacy as an entity any random organization name in any language
i tried the pretrained model of spacy to identify the organization names but but it fails on some places eg Rama works at Remote software
Yes, you can update an existing statistical model with more data, or train a new model from scratch. You can find more details and documentation here:
Training basics: https://spacy.io/usage/training#basics
Training the named entity recognizer: https://spacy.io/usage/training#ner
Training code examples: https://github.com/explosion/spaCy/tree/master/examples/training
spacy train command: https://spacy.io/api/cli#train