Training same model with different training data in RASA NLU - python

I have created a chatbot and added the training data(some hundred's) and trained it till now it's well and good. But when i added more training data some 50,000 and even more. Now, I'm stuck here RASA NLU unable to train that much amount of training data it can train upto 20,000 of training data but can't more than that.I'm getting "ERROR:Cannot allocate memory"

You may have to prune your training set in order to leave room for the new examples.
You don't need to feed your model with all the combinations of possible words. It's good at generalizing even learning a sparse set of combinations.
As you're working with 50k examples, I imagine you're already using a tool to generate them. You may look in the docs if it can help you do the pruning, or switch to the currently recommended one, that can.

Related

how many epochs needed for 68000 input frames for CNN + LSTM acrhitecture?

i'm trying to train my neural network. it aims is to predict video quality. i am using VGG16+LSTM networks, but the VGG16 is not trainable. the total of trainable parameters are 700,000.
i have three questions:
is enough 700,000 trainable parameters for training the 68000 input frames?
is it essential to train vgg16?
how many epochs needed for getting the best resaults?
I haven't been into machine learning in a while, but my understanding is that:
depends, but the only way to find out is to train it and look for over/underfitting
depends on the network layout. It might also be useful to bypass some information around the VGG16, in case the VGG16 hides some of the information you actually need about 'video quality'
depends. You wouldh have to split your data into a training and a test set in order to find that out.
As most things in machine learning and especially deep learning the answers aren't obvious and depend heavily on the problem and the exact network layout. There will be much trial and error involved.
The most important takeaway, I think, is to have two (or even three) different datasets for the training/validation/test step, so you can answer those questions yourself.
For more information, read the wikipedia entry about splitting your datasets.
You start with one and see what impact it had.
Even one epoch will take long and getting the error takes also a bit of time.

The way to combine two BERT model trained from scratch on different corpus to be one model?

I want to train BERT from scratch on a different language, with almost 150GB data (corpus), and I need to end this in maximum one month.
In some articles, I read that the time it took to train the language models like BERT on like data I have was more than a month over 8 TPUs.
I have two different accounts (we assume it is account-1 and account-2) on Google Cloud TPUs service, which offered me free TPUs for a month. I need to benefit from those different accounts and ending the training process during the free month.
So, I plan to do:
1- Divide the corpus into roughly equal parts, becomes corpus-1, corpus-2.
2- Training two model from scratch (BERT-1, BERT-2), each of them with a section of the corpus(corpus-1, corpus-2), and each model will train on a different account.
3- Saving each model after training.
The question is:
1- Is there any way to combine these two model to be one model?
(i.e, BERT-1 + BERT-2 = BERT-3, where BERT-3 becomes as if it has been trained on all the corpus=150GB), Is this possible?
Lastly, the combined model (BERT-3), I will fine-tune it on different NLP tasks.
I appreciate any contribution. Thank you.

Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations.
It is taking more than 2 hours to train completely.
Is there any way to train using multiprocessing? Will it improve the training time?
Short answer... probably not
It's very unlikely that you will be able to get this to work for a few reasons:
The network being trained is performing iterative optimization
Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
Any parallel training would be creating divergent networks...
...which you would then somehow have to merge
Long answer... there's plenty you can do!
There are a few different things you can try however:
Get GPU training working if you haven't
It's a pain, but can speed up training time a bit
It will dramatically lower CPU usage however
Try to use spaCy command line tools
The JSON format is a pain to produce but...
The benefit is you get a well optimised algorithm written by the experts
It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
Each of these may train faster
These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
spaCy NER extraction requires a surprisingly small amount of context to work
You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs
Final thoughts... when is your network "done"?
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.
However 90% of the increase in performance is captured in the first 10% of training.
Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?
If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.
Good luck!
Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:
Annotate your text files and save into JSON
Convert your JSON files into .spacy format because this is the format spacy accepts.
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.
Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.

Tensorflow retrain if it's wrong

I'm new to Tensorflow and AI, so I'm having trouble researching my question. Either that, or my question hasn't been answered.
I'm trying to make a text classifier to put websites into categories based on their keywords. I have at minimum 5,000 sites and maximum 37,000 sites to train with.
What I'm trying to accomplish is: after the model is trained, I want it to continue to train as it makes predictions about the category a website belongs in.
The keywords that the model is trained on is chosen by clients, so it can always be different than the rest of the websites in its category.
How can I make Tensorflow retrain it's model based on corrections made by me if it's prediction is inaccurate? Basically, to be training for ever.
The key phrase you lack is fine-tuning. This is when you take a model that has finished its customary training (whatever that may be), and needs more work for the application you have in mind. You then give it additional training with new input; when that training has completed (training accuracy plateaus and is close to test accuracy), you then deploy the enhanced model for your purposes.
This is often used in commercial applications -- for instance, when a large predictive model is updated to include the most recent week of customer activity. Another common use is to find a model in a zoo that is trained for something related to the application you want -- perhaps cats v dogs -- and use its recognition of facial features to shorten training for a model to identify two classes of cartoon characters -- perhaps Pokemon v Tiny Toons.
In this latter case, your fine-tuning will almost entirely eliminate what was learned by the last few layers of the model. What you gain is the early-layer abilities to find edges, regions, and features through eyes-nose-mouth combinations. This saves at least 30% of the overall training time.

Retraining and updating an existing Rasa NLU model

I've been using Rasa NLU for a project which involves making sense of structured text. My use case requires me to keep updating my training set by adding new examples of text corpus entities. However, this means that I have to keep retraining my model every few days, thereby taking more time for the same owing to increased training set size.
Is there a way in Rasa NLU to update an already trained model by only training it with the new training set data instead of retraining the entire model again using the entire previous training data set and the new training data set?
I'm trying to look for an approach where I can simply update my existing trained model by training it with incremental additional training data set every few days.
To date, the most recent Github issue on the topic states there is no way to retrain a model adding just the new utterances.
Same for previous issues cited therein.
You're right: having to retrain periodically with increasingly long files gets more and more time-consuming. Although, retraining in place is not a good idea in production.
Excellent example in a user comment:
Retraining on the same model can be a problem for production systems. I used to overwrite my models and then at some point, one of the training didn't work perfectly and I started to see a critical drop in my responses confidence. I had to find where the problem was coming from and retrain the model.
Training new model all the time (with a timestamp) is good because it makes rollbacks easier (and they will happen in production systems). I then fetch the up-to-date model names from DB.

Categories