I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations.
It is taking more than 2 hours to train completely.
Is there any way to train using multiprocessing? Will it improve the training time?
Short answer... probably not
It's very unlikely that you will be able to get this to work for a few reasons:
The network being trained is performing iterative optimization
Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
Any parallel training would be creating divergent networks...
...which you would then somehow have to merge
Long answer... there's plenty you can do!
There are a few different things you can try however:
Get GPU training working if you haven't
It's a pain, but can speed up training time a bit
It will dramatically lower CPU usage however
Try to use spaCy command line tools
The JSON format is a pain to produce but...
The benefit is you get a well optimised algorithm written by the experts
It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
Each of these may train faster
These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
spaCy NER extraction requires a surprisingly small amount of context to work
You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs
Final thoughts... when is your network "done"?
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.
However 90% of the increase in performance is captured in the first 10% of training.
Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?
If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.
Good luck!
Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:
Annotate your text files and save into JSON
Convert your JSON files into .spacy format because this is the format spacy accepts.
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.
Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.
Related
Somewhat a beginner here on deep learning using Python & Stack Overflow.
I am currently working on something similar to a sentiment analysis of community posts using LSTM, and have been trying to add preprocessing steps to clean up the text data.
I have lots of ideas - say, 7 - for modifying/dropping certain data without sacrificing context that I think could improve my prediction accuracy, but I want to be able to see exactly how implementing one or some of these ideas can affect the prediction accuracy.
So is there a tool, statistical method or technique that I can use that will drastically cut down on the number of experiments (training the model + predicting on test set) that I need to do to see how "toggling on" one, two, or several of these preprocessing steps can affect my prediction accuracy, instead of having to do like 49 experiments and filling out the results on a 7x7 table? I have used the Taguchi method of design of experiments on a different kind of problem before, but not sure it can be applied properly here since the neural network will be trained in a completely different way based on the data it is fed.
Thank you for any input and advice!
Python 3.9.6
I wrote the code to create word- embeddings for my domain (medicine books). My data consists of 45,000 normal length sentences (31 519 unique words, 591 347 all words). When I create / learn a model:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences,
min_count = 5,
vector_size = 200,
workers = multiprocessing.cpu_count(),
window = 6
)
model.save(full_path)
,it's trained about 1- 2 seconds, and the size of the saved model is about 15MB.
How can I check the correctness of the creation my word- embeddings?
There's not really 'correctness' for word-embeddings, just 'usefulness for desired purposes'.
Especially when you are training-up domain-specific vectors, for some specific intended task, you should try to create your own mix of evaluations.
Those might start as some ad hoc sanity checks, like: "if I look at model.most_similar('esophageal') (or many other probe words), do the results make sense to me?"
But it's even better if your evaluations are some repeatable quantitative scoring – such as a bunch of words that 'should' be closer-to-each-other than other words – that could be run, quickly, against a new model where you've tweaked parameters or added training data, to rank it against other models.
And best if your downstream application – info-retrieval, or classification, or recommendation, etc – itself has some robust scoring against ideal results, and you can apply that back to indicate whether one set of word-vectors is better than another for that use, and by how much.
Separately: that's not a very big training set for word2vec, but might be enough for some useful results. But that might be suspiciously fast completion. Try enabling logging at the INFO level, and watch the progress info to make sure it's making sense at each step. Test if using more than the default epochs=5 has the expected effect of making training take more time. (One common mistake is to pass a mere iterator, that's only capable of producing the training data once, instead of a true iterable that be be re-iterated many time, to the model. That error allows the model the one pass it needs to discover the vocabulary but not the epochs additional passes it needs for real training.)
I have created a chatbot and added the training data(some hundred's) and trained it till now it's well and good. But when i added more training data some 50,000 and even more. Now, I'm stuck here RASA NLU unable to train that much amount of training data it can train upto 20,000 of training data but can't more than that.I'm getting "ERROR:Cannot allocate memory"
You may have to prune your training set in order to leave room for the new examples.
You don't need to feed your model with all the combinations of possible words. It's good at generalizing even learning a sparse set of combinations.
As you're working with 50k examples, I imagine you're already using a tool to generate them. You may look in the docs if it can help you do the pruning, or switch to the currently recommended one, that can.
In Tensorflow, it seems that preprocessing could be done on either during training time, when the batch is created from raw images (or data), or when the images are already static. Given that theoretically, the preprocessing should take roughly equal time (if they are done using the same hardware), is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?
As a side question, could data augmentation even be done in Tensorflow if was not done during training?
Is there any practical disadvantage in doing data preprocessing (or
even data augmentation) before training than during training in
real-time?
Yes, there are advantages (+++) and disadvantages (---):
Preprocessing before training:
--- preprocessed samples need to be stored: disk space consumption* (1)
--- only a "finite" amount of samples can be generated
+++ no runtime during training
---... but samples always need be read from storage, i.e. maybe storage (disk) I/O becomes bottleneck
--- not flexible: changing datset/augmentation requires generating a new augmented dataset
+++ for Tensorflow: Easily work on numpy.ndarray or other dataformats with any high-level image API (open-cv, PIL, ...) to do augmentation or even use any other language/tool you like.
Preprocessing during training ("real-time"):
+++ infinite amount of samples can be generated (as it is generated on-the-fly)
+++ flexible: changing dataset/augmentation only requires changing code
+++ if dataset fits in memory, no disk I/O needed for data after reading once
--- adds runtime to your training* (2)
--- for Tensorflow: Building the preprocessing as part of the graph requires working with Tensors and restricts usage of APIs working on ndarrays or other formats.* (3)
Some specific aspects discussed in detail:
(1) Reproducing experiments "with the same data" is kind of straightforward with a dataset generated before training. However this can be solved (even more!) elegantly with storing a seed for real-time data generation.
(2): Training runtime for preprocessing: There are ways to avoid an expensive preprocessing pipeline to get in the way of your actual training. Tensorflow itself recommends filling Queues with many (CPU-)threads so that data generation can independently keep up with GPU data consumption. You can read more about this in the input pipeline performance guide.
(3): Data augmentation in tensorflow
As a side question, could data augmentation even be done in Tensorflow
if was not done during (I think you mean) before training?
Yes, tensorflow offers some functionality to do augmentation. In terms of value augmentation of scalar/vector (or also more dimensional data), you can easily build something yourself with tf.multiply or other basic math ops. For image data, there are several ops implemented (see tf.image and tf.contrib.image), which should cover a lot of augmentation needs.
There are off-the-shelf preprocessing examples on github, one of which is used and described in the CNN tutorial (cifar10).
Personally, I would always try to use real-time preprocessing, as generating (potentially huge) datasets feels clunky. But it is perfectly viable, I've seen it done many times and (as you see above) it definitely has it's advantages.
I have been wondering the same thing and have been disappointed with my during-training-time image processing performance. It has taken me a while to appreciate quite how big an overhead the image manipulation can be.
I am going to make myself a nice fat juicy preprocessed/augmented data file. Run it overnight and then come in the next day and be twice as productive!
I am using a single GPU machine and it seems obvious to me that piece-by-piece model building is the way to go. However, the workflow-maths may look different if you have different hardware. For example, on my Macbook-Pro tensorflow was slow (on CPU) and image processing was blinding fast because it was automatically done on the laptop's GPU. Now I have moved to a proper GPU machine, tensorflow is running 20x faster and the image processing is the bottleneck.
Just work out how long your augmentation/preprocessing is going to take, work out how often you are going to reuse it and then do the maths.
I am new in NLTK and machine learning. I'm using Python with NLTK Naive Bayes Classifier . I have create a Naive Bayes Classifier for text classification using NLTK and save it on disk. I am also able to load it when needed to classify some test data by using this python code:
import pickle
f = open('classifier.pickle')
classifier = pickle.load(f)
f.close()
But my problem is that whenever an new test data come , I have to load this classifier again and again in memory that takes lots of time (2-3 min) to load as it have large size. Also if I have to run two instances of the same sentimental analysis program, that will take double RAM as both program will load this classifier separately. My questions is: Is there any technique to store this classifier in memory so that whenever needed the sentimental anylysis programs can read this directly from memory or is there any other method through which the load time of the classifier can be minimize.
Thanks in advance for your help.
You can't have it both ways. You can either keep pickling/unpickling one at a time to use less RAM, or you can store both in memory, using twice as much ram, but reducing load times and disk i/o wait times.
Are the two classifiers trained using different training data, or are you using the same classifier in parallel? It sounds like the latter from your usage of "two instances", and in that case you may want to look into threading to allow the same classifier to work with two sets of data (some parallelism may be achieved by classify some of the data, then doing some other stuff like results processing to allow the other thread to classify, repeat).
My expertise in this comes from having started an open source NLTK based sentiment analysis system: https://bitbucket.org/tommyjcarpenter/evopminer.