Python 3.9.6
I wrote the code to create word- embeddings for my domain (medicine books). My data consists of 45,000 normal length sentences (31 519 unique words, 591 347 all words). When I create / learn a model:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences,
min_count = 5,
vector_size = 200,
workers = multiprocessing.cpu_count(),
window = 6
)
model.save(full_path)
,it's trained about 1- 2 seconds, and the size of the saved model is about 15MB.
How can I check the correctness of the creation my word- embeddings?
There's not really 'correctness' for word-embeddings, just 'usefulness for desired purposes'.
Especially when you are training-up domain-specific vectors, for some specific intended task, you should try to create your own mix of evaluations.
Those might start as some ad hoc sanity checks, like: "if I look at model.most_similar('esophageal') (or many other probe words), do the results make sense to me?"
But it's even better if your evaluations are some repeatable quantitative scoring – such as a bunch of words that 'should' be closer-to-each-other than other words – that could be run, quickly, against a new model where you've tweaked parameters or added training data, to rank it against other models.
And best if your downstream application – info-retrieval, or classification, or recommendation, etc – itself has some robust scoring against ideal results, and you can apply that back to indicate whether one set of word-vectors is better than another for that use, and by how much.
Separately: that's not a very big training set for word2vec, but might be enough for some useful results. But that might be suspiciously fast completion. Try enabling logging at the INFO level, and watch the progress info to make sure it's making sense at each step. Test if using more than the default epochs=5 has the expected effect of making training take more time. (One common mistake is to pass a mere iterator, that's only capable of producing the training data once, instead of a true iterable that be be re-iterated many time, to the model. That error allows the model the one pass it needs to discover the vocabulary but not the epochs additional passes it needs for real training.)
Related
I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.
I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.
At a naive & easy level, you can just load one existing model, and .train() on new data. But note if doing that:
Any words not already known by the model will be ignored, and the word-frequencies that feed algorithmic steps will only be from the initial survey
While all words in the current corpus will get as many training-updates as their appearances (& your epochs setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.
Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:
https://stackoverflow.com/a/57400356/130288
https://stackoverflow.com/a/59095246/130288
Suppose my corpus is reasonably large - having tens-of-thousands of unique words. I can either use it to build a word2vec model directly(Approach #1 in the code below) or initialize a new word2vec model with pre-trained model weights and fine tune it with my own corpus(Approach #2). Is the approach #2 worth consideration? If so, is there a rule of thumb on when I should consider a pre-trained model?
# Approach #1
from gensim.models import Word2Vec
model = Word2Vec(my_corpus, vector_size=300, min_count=1)
# Approach #2
model = Word2Vec(vector_size=300, min_count=1)
model.build_vocab(my_corpus)
model.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0)
model.train(my_corpus, total_examples=len(my_corpus))
The general answer to this type of question is: you should try them both, and see which works better for your purposes.
No one without your exact data & project goals can be sure which will work better in your situation, and you'll need to exact same kind of ability-to-evaluate alterante choices to do all sorts of very basic, necessary tuning of your work.
Separately:
"fine-tuning" word2vec-vectors can mean many things, and can introduce a number of expert-leve thorny tradeoff-decisions - the sorts of tradeoffs that can only be navigated if you've got a robust way to test different choices against each other.
The specific simple tuning approach your code shows - which relies on an experimental method (intersect_word2vec_format()) that might not work in the latest Gensim – is pretty limited, and since it discards all the words in the outside vectors that aren't already in your own corpus, also discards one of the major reasons people often want to mix older vectors in - to cover more words not in their training data. (I doubt that approach will be useful in many cases, but as per above, to be sure you'd want to try it with respect to your data/goals.
It's almost always a bad idea to use min_count=1 with word2vec & similar algorithms. If such rare words are truly important, find more training examples so good vectors can be trained for them. But without enough training examples, they're usually better to ignore - keeping them even makes the vectors for surrounding words worse.
I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations.
It is taking more than 2 hours to train completely.
Is there any way to train using multiprocessing? Will it improve the training time?
Short answer... probably not
It's very unlikely that you will be able to get this to work for a few reasons:
The network being trained is performing iterative optimization
Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
Any parallel training would be creating divergent networks...
...which you would then somehow have to merge
Long answer... there's plenty you can do!
There are a few different things you can try however:
Get GPU training working if you haven't
It's a pain, but can speed up training time a bit
It will dramatically lower CPU usage however
Try to use spaCy command line tools
The JSON format is a pain to produce but...
The benefit is you get a well optimised algorithm written by the experts
It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
Each of these may train faster
These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
spaCy NER extraction requires a surprisingly small amount of context to work
You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs
Final thoughts... when is your network "done"?
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.
However 90% of the increase in performance is captured in the first 10% of training.
Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?
If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.
Good luck!
Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:
Annotate your text files and save into JSON
Convert your JSON files into .spacy format because this is the format spacy accepts.
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.
Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.
what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.
def removeusername(tweet):
return " ".join(word.strip() for word in re.split('#|_', tweet))
def removingSpecialchar(text):
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.
I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!
Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.
Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!
I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.
For preprocessing raw data, you can try:
Stop word removal.
Stemming or Lemmatization.
Exclude terms that are either too common or too rare.
Then a second step preprocessing is possible:
Construct a TFIDF matrix.
Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).
Then you can load result of the second steps into your model.
These are just the most common "method", many others exists.
I will let you check each one of these methods by yourself, but it is a good base.
There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.
Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (e.g the positive has only positives)? how do you define accuracy and why?
Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.
This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.
Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options:
(1) Get more training data (at least 2000)
(2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times
(3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning.
(4) Try regularization techniques
I am a bit confused regarding an aspect of Doc2Vec. Basically, I am not sure if what I do makes sense. I have the following dataset :
train_doc_0 --> label_0
... ...
train_doc_99 --> label_0
train_doc_100 --> label_1
... ...
train_doc_199 --> label_1
... ...
... ...
train_doc_239999 --> label_2399
eval_doc_0
...
eval_doc_29
Where train_doc_n is a short document, belonging to some label. There are 2400 labels, with 100 training documents per label. eval_doc_0 are evaluation documents where I would like to predict their label in the end (using a classifier).
I train a Doc2Vec model with these training documents & labels. Once the model is trained, I reproject each of the original training document as well as my evaluation documents (the ones I would like to classify in the end) into the model's space using infer_vector.
The resulting is a matrix :
X_train (240000,300) # doc2vec vectors for training documents
y_train (240000,) # corresponding labels
y_eval (30, 300) # doc2vec vectors for evaluation documents
My problem is the following : If I run a simple cross validation on X_train and y_train, I have a decent accuracy. Once I try to classify my evaluation documents (even, using only 50 randomly sampled labels) I have a super bad accuracy, which makes me question my way of approaching this problem.
I followed this tutorial for the training of documents.
Does my approach make sense, especially with reprojecting all the training documents using infer_vector ?
I don't see anything blatantly wrong.
Are the evaluation documents similar to the training documents in length, vocabulary, etc? Ideally, they'd be a randomly-chosen subset of all available labeled examples. (If quite different, that might be a reason why cross-validation versus held-out-evaluation accuracy varies.)
When training the Doc2Vec model, are you giving each document a single unique ID as the only entry of its tags? Or are you using the label_n labels as the tags of your training examples? Or perhaps both? (Any of those are defensible choices, though I've found mixing known-labels into the usually 'unsupervised' Doc2Vec training, making it semi-supervised, often helps the mdoels' vectors become more useful as input to later explicitly-supervised classifiers.)
When I get unprecedented 'super-bad' accuracy in an unexpected step, often it's because some erroneous shuffling or re-ordering of the test examples has occurred – randomizing the real relationships. So it's worth doubling-checking for that, in code and by looking at a few examples in detail.
Re-inferring vectors for examples used in training, rather than simply asking for the trained-up vectors retained in the model, sometimes results in better vectors. However, many have observed that different-than-default parameters to infer_vector(), especially many-more steps and perhaps a starting alpha closer to that used during training, may improve results. (Also, inference seems to work better in fewer steps in the simpler PV-DBOW, dm=0, mode. PV-DM, dm=1, may especially require more steps.)
The tutorial you link shows a practice, calling train() multiple times while adjusting alpha yourself, that's generally unnecessary and error-prone – and specifically isn't likely to be doing the right thing in the latest gensim versions. You can leave the default alpha/min_alpha in place, and supply a preferred iter value during Doc2Vec initialization - and then one call to train() will automatically do that many passes, and glide the learning-rate down properly. And since the default iter is 5, if you don't set it, every call to train() is doing 5 passes - so doing your own external loop of 10 would mean 50 passes, and the code at that tutorial, with two calls to train() per loop for some odd reason, would mean 100 passes.