spacy how to add patterns to existing Entity ruler? - python

My spacy version is 2.3.7. I have an existing trained custom NER model with NER and Entity Ruler pipes.
I want to update and retrain this existing pipeline.
The code to create the entity ruler pipe was as follows-
ruler = EntityRuler(nlp)
for i in patt_dict:
ruler.add_patterns(i)
nlp.add_pipe(ruler, name = "entity_ruler")
Where patt_dict is the original patterns dictionary I had made.
Now, after finishing the training, now I have more input data and want to train the model more with the new input data.
How can I modify the above code to add more of patterns dictionary to the entity ruler when I load the spacy model later and want to retrain it with more input data?

It is generally better to retrain from scratch. If you train only on new data you are likely to run into "catastrophic forgetting", where the model forgets anything not in the new data.
This is covered in detail in this spaCy blog post. As of v3 the approach outlined there is available in spaCy, but it's still experimental and needs some work. In any case, it's still kind of a workaround, and the best thing is to train from scratch with all data.

I'd also recommend polm23's suggestion to retrain fully in this situation.
Here is why: we are asking the model to produce inferences based on weights derived from matching input data to labels/classes/whatever over and over. These weights are toggled via backprop to reduce the error gradient vis a vis the labels/classes/whatever. When the weights, given whatever data, produce errors as close to 0 as possible eventually the loss reaches an equilibrium or you just call it via hyper parameters (epochs).
However, by only using the new data, you will only optimize for that specific data. The model will generalize poorly, but really only because it is learning exactly what you asked it to learn and nothing else. When you add in that retraining fully is usually not the end of the world, it just kinda makes sense as a best practice.
(This is my imperfect understanding of the catastrophic forgetting issue, happy to learn more if other's have deeper knowledge).

Related

Update PyCaret Anomaly detection Model

I'm detecting anomalies in a time series data using pycaret. I'm taking in the data at every call, detecting and returning it. Everything is fine, but when coming to improving the performance, I'm planning to load the saved model, re-train it with less data(say daily instead of getting some 1000 days of data at once) and save the model again. Here its performance increases a lot, as it is training on only less data.
The problem is to update/re-train the model. I couldn't find any method to update the model.
Base Initially:
setup(dataframe)
model=createmodel(modelName)
results=assign_model(model)
What I'm trying to do.
try loading the model if already present.
setup(data_frame_new)
if model.exists:
retrain_model
else:
model=createmodel(modelName)
save_model(model)
results=assign_model(model)
So, now I have trained model and new data, how can I integrate both.
Is there any way to retrain the model? I couldn't see any documentation on that so far.
Or I might have overlooked. Please put forth your valuable comments to let me know how to achieve this.

word-embendding: Convert supervised model into unsupervised model

I'm wanna to load an pre-trained embendding to initalize my own unsupervise FastText model and retrain with my dataset.
The trained embendding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try:
FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.
Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?
I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that
If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)
However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().
(That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)
Update after comment below:
Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.
You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)
But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.
You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)
Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)
So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.

Tensorflow retrain if it's wrong

I'm new to Tensorflow and AI, so I'm having trouble researching my question. Either that, or my question hasn't been answered.
I'm trying to make a text classifier to put websites into categories based on their keywords. I have at minimum 5,000 sites and maximum 37,000 sites to train with.
What I'm trying to accomplish is: after the model is trained, I want it to continue to train as it makes predictions about the category a website belongs in.
The keywords that the model is trained on is chosen by clients, so it can always be different than the rest of the websites in its category.
How can I make Tensorflow retrain it's model based on corrections made by me if it's prediction is inaccurate? Basically, to be training for ever.
The key phrase you lack is fine-tuning. This is when you take a model that has finished its customary training (whatever that may be), and needs more work for the application you have in mind. You then give it additional training with new input; when that training has completed (training accuracy plateaus and is close to test accuracy), you then deploy the enhanced model for your purposes.
This is often used in commercial applications -- for instance, when a large predictive model is updated to include the most recent week of customer activity. Another common use is to find a model in a zoo that is trained for something related to the application you want -- perhaps cats v dogs -- and use its recognition of facial features to shorten training for a model to identify two classes of cartoon characters -- perhaps Pokemon v Tiny Toons.
In this latter case, your fine-tuning will almost entirely eliminate what was learned by the last few layers of the model. What you gain is the early-layer abilities to find edges, regions, and features through eyes-nose-mouth combinations. This saves at least 30% of the overall training time.

Retraining and updating an existing Rasa NLU model

I've been using Rasa NLU for a project which involves making sense of structured text. My use case requires me to keep updating my training set by adding new examples of text corpus entities. However, this means that I have to keep retraining my model every few days, thereby taking more time for the same owing to increased training set size.
Is there a way in Rasa NLU to update an already trained model by only training it with the new training set data instead of retraining the entire model again using the entire previous training data set and the new training data set?
I'm trying to look for an approach where I can simply update my existing trained model by training it with incremental additional training data set every few days.
To date, the most recent Github issue on the topic states there is no way to retrain a model adding just the new utterances.
Same for previous issues cited therein.
You're right: having to retrain periodically with increasingly long files gets more and more time-consuming. Although, retraining in place is not a good idea in production.
Excellent example in a user comment:
Retraining on the same model can be a problem for production systems. I used to overwrite my models and then at some point, one of the training didn't work perfectly and I started to see a critical drop in my responses confidence. I had to find where the problem was coming from and retrain the model.
Training new model all the time (with a timestamp) is good because it makes rollbacks easier (and they will happen in production systems). I then fetch the up-to-date model names from DB.

Named Entity Recognition in practice

I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.
I understand the basic concept behind it, but I suspect I am missing some details.
From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.
Do you know if:
In order to work, the text has to go through chunking before the model is trained, right? Otherwise it wouldn't be able to perform anything useful?
Are the text and annotations typically normalized prior to the training of the model? So that if a named entity is at the beginning or middle of a sentence it can still work?
Specifically in spaCy, how are things implemented concretely? Is it a HMM, CRF or something else that is used to build the model?
Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.
In https://spacy.io/models/en#en_core_web_md they say English multi-task CNN trained on OntoNotes. So I imagine that's how they obtain the NEs. You can see that the pipeline is
tagger, parser, ner
and read more here: https://spacy.io/usage/processing-pipelines. I would try to remove the different components and see what happens. This way you could see what depends on what. I'm pretty sure NER depends on tagger, but not sure whether requires the parser. All of them of course require the tokenizer
I don't understand your second point. If an entity is at the beginning or middle of a sentence is just fine, the NER system should be able to catch it. I don't see how you're using the word normalize in a position of text context.
Regarding the model, they mention multi-task CNN, so I guess the CNN is the model for NER. Sometimes people use a CRF on top, but they don't mention it so probably is just that. According to their performance figures, it's good enough

Categories