I'm trying to compare the benefits of word vectors trained on in-domain data to word vectors trained on non-domain specific data on spaCy NER.
I have built two word2vec models, each one trained on different text.
Then, I try to build and evaluate two spaCy models, each with a different txt file containing the word2vec embeddings.
This is the code that I use to initiate the model:
!python -m spacy init vectors en models/w2v/merged/word2vec_merged_w2v.txt models/w2v/merged --name en_test
It runs succesfully, and both create different amounts of vectors.
1st model:
Creating blank nlp object for language 'en'
[+] Successfully converted 28093 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Ruben Weijers\models\w2v\merged
2nd model:
Creating blank nlp object for language 'en'
[+] Successfully converted 34712 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Ruben Weijers\models\w2v\quine
The vectors are different. I then try to load and train using the following code:
nlp = spacy.load("models/w2v/merged")
nlp.add_pipe('ner')
nlp.to_disk("models/w2v/merged")
#train model
#train model
!python -m spacy train models/w2v/merged/config.cfg --output models/w2v/merged --paths.train data/train2.spacy --paths.dev data/test2.spacy --paths.vectors models/w2v/merged
I do the same for the other model, pathing to the other vector file ofcourse.
However, the training pipeline shows that both models have exactly the same precision, recall and loss rate throughout the learning process.
Also, when I call:
#evaluate model
!python -m spacy evaluate models/w2v/merged/model-best data/val2.spacy --output models/w2v/merged/metrics.json
Both models have the same performance metrics, while the vectors are totally different.
I have looked up different videos on how to path vectors, I have added paths to the vectors to the config files, on top of adding
--paths.vectors models/w2v/merged
All doesn't seem to help. Many videos show how to implement word2vec, yet don't evaluate. I'm curious to see as to why both word2vec models appear to be exactly the same. It doesn't make sense. I have checked that pathing is correct, and files are in the correct place multiple times. It doesn't seem like that's the issue since the different numbers returned in creation of vectors also shows that the vector files are different.
I have created the word vectors using:
def train_w2v_model(model_name):
w2v_model = Word2Vec(min_count=5,
window=2,
vector_size=500,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30)
w2v_model.save(f"downloads/{model_name}.model")
w2v_model.wv.save_word2vec_format(f"downloads/word2vec_{model_name}.txt")
I know it's been a along time since this question was asked, but I ran into the same problem and I found the solution, so I thought it could be useful for somebody else to share it.
Instead of
nlp.add_pipe('ner')
you should do this:
!python -m spacy init config -p ner -o accuracy config.cfg
Related
I'm training a custom named entity recognition model, I created the config.cfg and train.spacy files, among all it has, I'm using this as pre-trained vectors en_core_web_lg
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null
I then train the model using
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy
This works and I can see the output model.
Then I want to train another NER model that has nothing to do with the previous one (same code different data) and I get this error:
Error: [E884] The pipeline could not be initialized because the vectors could not be found at 'en_core_web_lg'.
If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.
It looks like it modified the base en_core_web_lg model, which can be a problem for me since I use it for different models, some fine-tuned and others just out of the box.
How can I train this NER model making sure the downloaded en_core_web_lg model is not modified? and would this ensure that I can train several models without interfering with each other?
When you use a model as a source of vectors, or for that matter a source for any other part of a pipeline, spaCy will not modify it under any circumstances. Something else is going on.
Are you perhaps using a different virtualenv? Does spacy.load("en_core_web_lg") work?
One thing that could be happening (but seems less likely) is that in some fields, you can use the name of an installed pipeline (using entry points) or a local path. If you have a directory named en_core_web_lg where you are training that could be checked first.
I'm trying to reload a DistilBertForSequenceClassification model I've fine-tuned and use that to predict some sentences into their appropriate labels (text classification).
In google Colab, after successfully training the BERT model, I downloaded it after saving:
trainer.train()
trainer.save_model("distilbert_classification")
The downloaded model has three files: config.json, pytorch_model.bin, training_args.bin.
I moved them encased in a folder named 'distilbert_classification' somewhere in my google drive.
afterwards, I reloaded the model in a different Colab notebook:
reloadtrainer = DistilBertForSequenceClassification.from_pretrained('google drive directory/distilbert_classification')
Up to this point, I have succeeded without any errors.
However, how to I use this reloaded model (the 'reloadtrainer' object) to actually make the predictions on sentences? What is the code I need to use afterwards? I tried
reloadtrainer .predict("sample sentence") but it doesn't work. Would appreciate any help!
Remember that you also need to tokenize the input to your model, just like in the training phase. Merely feeding a sentence to the model will not work (unless you use pipelines() but that's another discussion).
You may use an AutoModelForSequenceClassification() and AutoTokenizer() to make things easier.
Note that the way I am saving the model is via model.save_pretrained("path_to_model") rather than model.save().
One possible approach could be the following (say you trained with uncased distilbert):
model = AutoModelForSequenceClassification.from_pretrained("path_to_model")
# Replace with whatever tokenizer you used
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)
input_text = "This is the text I am trying to classify."
tokenized_text = tokenizer(input_text,
truncation=True,
is_split_into_words=False,
return_tensors='pt')
outputs = model(tokenized_text["input_ids"])
predicted_label = outputs.logits.argmax(-1)
I want to add more words to the SpaCy model in portuguese so that I can use the PoS (part of speech) of a specific domain, but I don't want to add isolated words but sentences. I did these three steps:
I converted the "PetroTok-UDPIPE.conllu" file (freely available here: http://petroles.ica.ele.puc-rio.br/, this is inside of the "PetroTok" file and contains sentences (not individual words) with their respective PoS and lemmas) to a binary "PetroTok-UDPIPE.spacy" file, with the following command (indicated on the SpaCy page: https://spacy.io/usage/training#data):
python -m spacy convert PetroTok-UDPIPE.conllu .
This created the "PetroTok-UDPIPE.spacy" file.
Then, I created the "base_config.cfg" file (as indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
changing the values of "train" and "dev" to:
train = "PetroTok-UDPIPE.spacy"
dev = "PetroTok-UDPIPE.spacy"
(In this case I am considering the same data for train and validation, just for testing).
Having that file I use the following command line to create the "config.cfg" file (also indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
python -m spacy init fill-config base_config.cfg config.cfg
I apply the following command to create the model (as indicated in the SpaCy page: https://spacy.io/usage/training#quickstart):
python -m spacy train config.cfg --output ./output
That prints the following output:
...When testing a simple code loading the created model in the paste "output" it returns empty lists for the ".lemma_" and ".pos_" of the string "INTRODUÇÃO.":
lemma = ['', '']
pos = ['', '']
Could you please help me to identify the implicit error? I have another question, the model created in this way is created only with the "PetroTok-UDPIPE.conllu" file or is it a model that incorporates elements to the model in Portuguese (in this case)?
Thank you.
Your model is probably setting the .tag_ attribute but not the .pos_ attribute.
In the official models, what happens is that language-specific tags (.tag_) are learned by the model, and then an AttributeRuler maps them to Universal Dependency tags (.pos_). The quickstart doesn't configure that by default because there's different ways to do it, so you just get .tag_.
I have another question, the model created in this way is created only with the "PetroTok.conllu" file or is it a model that incorporates elements to the model in Portuguese (in this case)?
The model will learn from scratch unless you tell it to do otherwise. Retraining a model without the other data is prone to catastrophic forgetting and not recommended, and training on two datasets for the same task with different tagsets sounds infeasible.
I'm working on a project for text similarity using FastText, the basic example I have found to train a model is:
from gensim.models import FastText
model = FastText(tokens, size=100, window=3, min_count=1, iter=10, sorted_vocab=1)
As I understand it, since I'm specifying the vector and ngram size, the model is been trained from scratch here and if the dataset is small I would spect great resutls.
The other option I have found is to load the original Wikipedia model which is a huge file:
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.simple')
My question is, can I load the Wikipedia or any other model, and fine tune it with my dataset?
If you have a labelled dataset, then you should be able to fine-tune to it. This GitHub issue explains that you want to use the pretrainedVectors option. You would start with the Wikipedia pretrained vectors, then train on your dataset. It seems that gensim can do this, but according to this GH issue, there has been some bugs.
Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)