Retraining and updating an existing Rasa NLU model - python

I've been using Rasa NLU for a project which involves making sense of structured text. My use case requires me to keep updating my training set by adding new examples of text corpus entities. However, this means that I have to keep retraining my model every few days, thereby taking more time for the same owing to increased training set size.
Is there a way in Rasa NLU to update an already trained model by only training it with the new training set data instead of retraining the entire model again using the entire previous training data set and the new training data set?
I'm trying to look for an approach where I can simply update my existing trained model by training it with incremental additional training data set every few days.

To date, the most recent Github issue on the topic states there is no way to retrain a model adding just the new utterances.
Same for previous issues cited therein.
You're right: having to retrain periodically with increasingly long files gets more and more time-consuming. Although, retraining in place is not a good idea in production.
Excellent example in a user comment:
Retraining on the same model can be a problem for production systems. I used to overwrite my models and then at some point, one of the training didn't work perfectly and I started to see a critical drop in my responses confidence. I had to find where the problem was coming from and retrain the model.
Training new model all the time (with a timestamp) is good because it makes rollbacks easier (and they will happen in production systems). I then fetch the up-to-date model names from DB.

Related

Update PyCaret Anomaly detection Model

I'm detecting anomalies in a time series data using pycaret. I'm taking in the data at every call, detecting and returning it. Everything is fine, but when coming to improving the performance, I'm planning to load the saved model, re-train it with less data(say daily instead of getting some 1000 days of data at once) and save the model again. Here its performance increases a lot, as it is training on only less data.
The problem is to update/re-train the model. I couldn't find any method to update the model.
Base Initially:
setup(dataframe)
model=createmodel(modelName)
results=assign_model(model)
What I'm trying to do.
try loading the model if already present.
setup(data_frame_new)
if model.exists:
retrain_model
else:
model=createmodel(modelName)
save_model(model)
results=assign_model(model)
So, now I have trained model and new data, how can I integrate both.
Is there any way to retrain the model? I couldn't see any documentation on that so far.
Or I might have overlooked. Please put forth your valuable comments to let me know how to achieve this.

Is the performance of a deep learning model affected if it has "seen" the same test images before?

I am working with the YOLOv3 model for an object detection task. I am using pre-trained weights that were generated for the COCO dataset, however, I have my own data for the problem I am working on. According to my knowledge, using those trained weights as a starting point for my own model should not have any effect on the performance of the model once it is trained on an entirely different dataset (right?).
My question is: will the model give "honest" results if I train it multiple times and test it on the same test set each time, or would it have better performance since it has already been exposed to those test images during an earlier experiment? I've heard people say things like "the model has already seen that data", does that apply in my case?
For hyper-parameter selection, evaluation of different models, or evaluating during training, you should always use a validation set.
You are not allowed to use the test set until the end!
The whole purpose of test data is to get an estimation of the performance after the deployment. When you use it during training to train your model or evaluate your model, you expose that data. For example, based on the accuracy of the test set, you decide to increase the number of layers.
Now, your performance will be increased on the test set. However, it will come with a price!
Your estimation on the test set becomes biased, and you no longer be able to use that estimation to talk about data that your model sees after deployment.
For example, You want to train an object detector for self-driving cars, and you exposed the test set during training. Therefore, you can not use the accuracy on the test set to talk about the performance of the object detector when you put it on a car and sell it to a customer.
There is an old sentence related to this matter:
If you torture the data enough, it will confess.

spacy how to add patterns to existing Entity ruler?

My spacy version is 2.3.7. I have an existing trained custom NER model with NER and Entity Ruler pipes.
I want to update and retrain this existing pipeline.
The code to create the entity ruler pipe was as follows-
ruler = EntityRuler(nlp)
for i in patt_dict:
ruler.add_patterns(i)
nlp.add_pipe(ruler, name = "entity_ruler")
Where patt_dict is the original patterns dictionary I had made.
Now, after finishing the training, now I have more input data and want to train the model more with the new input data.
How can I modify the above code to add more of patterns dictionary to the entity ruler when I load the spacy model later and want to retrain it with more input data?
It is generally better to retrain from scratch. If you train only on new data you are likely to run into "catastrophic forgetting", where the model forgets anything not in the new data.
This is covered in detail in this spaCy blog post. As of v3 the approach outlined there is available in spaCy, but it's still experimental and needs some work. In any case, it's still kind of a workaround, and the best thing is to train from scratch with all data.
I'd also recommend polm23's suggestion to retrain fully in this situation.
Here is why: we are asking the model to produce inferences based on weights derived from matching input data to labels/classes/whatever over and over. These weights are toggled via backprop to reduce the error gradient vis a vis the labels/classes/whatever. When the weights, given whatever data, produce errors as close to 0 as possible eventually the loss reaches an equilibrium or you just call it via hyper parameters (epochs).
However, by only using the new data, you will only optimize for that specific data. The model will generalize poorly, but really only because it is learning exactly what you asked it to learn and nothing else. When you add in that retraining fully is usually not the end of the world, it just kinda makes sense as a best practice.
(This is my imperfect understanding of the catastrophic forgetting issue, happy to learn more if other's have deeper knowledge).

Training same model with different training data in RASA NLU

I have created a chatbot and added the training data(some hundred's) and trained it till now it's well and good. But when i added more training data some 50,000 and even more. Now, I'm stuck here RASA NLU unable to train that much amount of training data it can train upto 20,000 of training data but can't more than that.I'm getting "ERROR:Cannot allocate memory"
You may have to prune your training set in order to leave room for the new examples.
You don't need to feed your model with all the combinations of possible words. It's good at generalizing even learning a sparse set of combinations.
As you're working with 50k examples, I imagine you're already using a tool to generate them. You may look in the docs if it can help you do the pruning, or switch to the currently recommended one, that can.

Keras/ Tensorflow - Deploying a model (categorization and normalization)

I came to the point where I deployed my trained model done with Keras and based on Tensorflow. I have Tensorflow-Serving running and when I feed the input with the test data I get exactly the output as expected. Wonderful!
However, in the real world (deployed scenario) I need to pass a new data set to the model that the model has never seen before. And in the training/testing setup I did categorization and one-hot encoding. So I need to transform the submitted data-set first. This I might be able to do.
I also did normalization (Standardscaler from sklearn) and now I have no clue what is best practice to do here. In order to do normalization I would need to run through the training data again plus the one submitted data-set.
I believe this can be solved in an elegant way. Any ideas?

Categories