I am pretty new to doc2vec then I made small research and found a couple of things. Here is my story: I am trying to learn using doc2vec 2.4 million documents. At first, I tried only doing so with a small model of 12 documents. I checked the results with infer vector of the first document and found it to be similar indeed to the first document by 0.97-0.99 cosine similarity measure. Which I found good, even though when I tried to enter a new document of completely different words I received a high score of 0.8 measure similarity. However, I had put it aside and tried to go on and build the full model with the 2.4 million documents. In this point, my problems began. The result was complete nonsense, I received in the most_similar function results with a similarity of 0.4-0.5 which were completely different from the new document checked. I tried to tune parameters but no result yet. I tried also to remove randomness both from the small and big model, however, I still got different vectors. Then I had tried to use get_latest_training_loss on each epoch in order to see how the loss changes over each epoch. This is my code:
model = Doc2Vec(vector_size=300, alpha=0.025, min_alpha=0.025, pretrained_emb=".../glove.840B.300D/glove.840B.300d.txt", seed=1, workers=1, compute_loss=True)
workers=1, compute_loss=True)
model.build_vocab(documents)
for epoch in range(10):
for i in range(model_glove.epochs):
model.train(documents, total_examples = token_count, epochs=1)
training_loss = model.get_latest_training_loss()
print("Training Loss: " + str(training_loss))
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
I know this code is a bit awkward, but it is used here only to follow the loss.
The error I receive is:
AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'
I tried looking at model. and auto-complete and found that indeed there is no such function, I found something similar name training_loss, but it gives me the same error.
Anyone here can give me an idea?
Thanks in Advance
Especially as a beginner, there's no pressing need to monitor training-loss. For a long time, gensim didn't report it in any way for any models – and it was still possible to evaluate & tune models.
Even now, running-loss-reporting in gensim kind of a rough, incomplete, advanced/experimental feature – and after a recent refactoring it doesn't seem to have full support in Doc2Vec. (Notably, while having the loss level reach a plateau can be a helpful indicator that further training can't help, it is most definitely not the case that a model with arbitrarily-lower-loss is better than others. In particular, a model that achieves near-zero loss would likely be extremely overfit, and probably of little use for downstream applications.)
Regarding your general aim, of getting good vectors, with regard to the process you've described/shown:
Tiny tests (as with your 12 documents) don't really work with these algorithms, except to check that you're calling the steps with legal parameters. You shouldn't expect the similarities in such toy-sized tests to mean anything, even if they superficially meet expectations in some cases. The algorithms need lots of training data & large vocabularies to train sensible models. (So, your full 2.4 million docs should work well.)
You generally shouldn't be changing the default alpha/min_alpha values, or call train() multiple times in a loop. You can just leave those at their defaults, and call train() with your desired number of training epochs – and it will do the right thing. The approach in your shown code is a suboptimal and fragile anti-pattern – whichever online source you learned it from is misguided and severely outdated.
You haven't shown your inference code, but note that it will re-use the epochs, alpha, and min_alpha cached in the model instance from original initialization, unless you supply other values. And, the default epochs if not-specified is a value inherited from shared code with Word2Vec of just 5. Doing a mere 5 epochs, and leaving the effective alpha at 0.025 the whole time (as alpha=0.025, min_alpha=0.025 does to inference), is unlikely to give good results, especially on short docs. Common epochs values from published work are 10-20 - and doing at least as many for inference as were used for training is typical.
You are showing the use of a pretrained_emb initialization parameter that is not part of the standard gensim library, so perhaps you're using some other fork, based on some older version of gensim.. Note that it's not typical to initialize a Doc2Vec model with word-embeddings from elsewhere before training, so if doing that, you're already in advanced/experimental territory – which is premature if you're still trying to get some basic doc-vectors into reasonable shape. (And, usually people seek tricks like re-used word-vectors if they have a small corpus. With 2.4 million docs, you probably don't have such corpus problems – any word-vectors can be learned from your corpus along with doc-vectors, in the default way.)
Related
I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.
Suppose my corpus is reasonably large - having tens-of-thousands of unique words. I can either use it to build a word2vec model directly(Approach #1 in the code below) or initialize a new word2vec model with pre-trained model weights and fine tune it with my own corpus(Approach #2). Is the approach #2 worth consideration? If so, is there a rule of thumb on when I should consider a pre-trained model?
# Approach #1
from gensim.models import Word2Vec
model = Word2Vec(my_corpus, vector_size=300, min_count=1)
# Approach #2
model = Word2Vec(vector_size=300, min_count=1)
model.build_vocab(my_corpus)
model.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0)
model.train(my_corpus, total_examples=len(my_corpus))
The general answer to this type of question is: you should try them both, and see which works better for your purposes.
No one without your exact data & project goals can be sure which will work better in your situation, and you'll need to exact same kind of ability-to-evaluate alterante choices to do all sorts of very basic, necessary tuning of your work.
Separately:
"fine-tuning" word2vec-vectors can mean many things, and can introduce a number of expert-leve thorny tradeoff-decisions - the sorts of tradeoffs that can only be navigated if you've got a robust way to test different choices against each other.
The specific simple tuning approach your code shows - which relies on an experimental method (intersect_word2vec_format()) that might not work in the latest Gensim – is pretty limited, and since it discards all the words in the outside vectors that aren't already in your own corpus, also discards one of the major reasons people often want to mix older vectors in - to cover more words not in their training data. (I doubt that approach will be useful in many cases, but as per above, to be sure you'd want to try it with respect to your data/goals.
It's almost always a bad idea to use min_count=1 with word2vec & similar algorithms. If such rare words are truly important, find more training examples so good vectors can be trained for them. But without enough training examples, they're usually better to ignore - keeping them even makes the vectors for surrounding words worse.
I recently tested many hyperparameter combinations using sklearn.model_selection.GridSearchCV. I want to know if there is a way to call all previous estimators that were trained in the process.
search = GridSearchCV(estimator=my_estimator, param_grid=parameters)
# `my_estimator` is a gradient boosting classifier object
# `parameters` is a dictionary containing all the hyperparameters I want to try
I know I can call the best estimator with search.best_estimator_, but I would like to call all other estimators as well so I can test their individual performance.
The search took around 35 hours to complete, so I really hope I do not have to do this all over again.
NOTE: This was asked a few years ago (here), but sklearn has been updated multiple times since and the anwer may be different now (I hope).
No, none of the tested models are saved, except (optionally, but by default) one final one trained on the entire training set, your best_estimator_. Especially when models store significant amounts of data (e.g. KNNs), saving all the fitted estimators would be very memory-expensive, and usually not of much use. (cross_validate does have a parameter return_estimator, but the hyperparameter tuners do not. If you have a compelling reason to add it, it probably wouldn't take much work and you could open a GitHub Issue at sklearn.)
However, you do have the cv_results_ attribute that documents the scores of all of the tested estimators. That's usually enough for inspection purposes.
I had implemented a CNN with 3 Convolutional layers with Maxpooling and dropout after each layer
I had noticed that when I trained the model for the first time it gave me 88% as testing accuracy but after retraining it for the second time successively, with the same training dataset it gave me 92% as testing accuracy.
I could not understand this behavior, is it possible that the model had overfitting in the second training process?
Thank you in advance for any help!
It is quite possible if you have not provided the seed number set.seed( ) in the R language or tf.random.set_seed(any_no.) in python
Well I am no expert when it comes to machine learning but I do know the math behind it. What you are doing when you train a neural network you basicly find the local minima to the loss function. What this means is that the end result will heavily depend on the initial guess of all of the internal varaibles.
Usually the variables are randomized as a initial estimation and you could therefore reach quite different results from running the training process multiple times.
That being said, from when I studied the subject I was told that you usually reach similar regardless of the initial guess of the parameters. However it is hard to say if 0.88 and 0.92 would be considered similar or not.
Hope this gives a somewhat possible answer to your question.
As mentioned in another answer, you could remove the randomization, both in the parameter initialization of the parameters and the randomization of the data used for each epoch of training by introducing a seed. This would insure that when you run it twice, everything will get "randomized" in the exact same order. In tensorflow this is done using for example tf.random.set_seed(1), the number 1 can be changed to any number to get a new seed.
I am a bit confused regarding an aspect of Doc2Vec. Basically, I am not sure if what I do makes sense. I have the following dataset :
train_doc_0 --> label_0
... ...
train_doc_99 --> label_0
train_doc_100 --> label_1
... ...
train_doc_199 --> label_1
... ...
... ...
train_doc_239999 --> label_2399
eval_doc_0
...
eval_doc_29
Where train_doc_n is a short document, belonging to some label. There are 2400 labels, with 100 training documents per label. eval_doc_0 are evaluation documents where I would like to predict their label in the end (using a classifier).
I train a Doc2Vec model with these training documents & labels. Once the model is trained, I reproject each of the original training document as well as my evaluation documents (the ones I would like to classify in the end) into the model's space using infer_vector.
The resulting is a matrix :
X_train (240000,300) # doc2vec vectors for training documents
y_train (240000,) # corresponding labels
y_eval (30, 300) # doc2vec vectors for evaluation documents
My problem is the following : If I run a simple cross validation on X_train and y_train, I have a decent accuracy. Once I try to classify my evaluation documents (even, using only 50 randomly sampled labels) I have a super bad accuracy, which makes me question my way of approaching this problem.
I followed this tutorial for the training of documents.
Does my approach make sense, especially with reprojecting all the training documents using infer_vector ?
I don't see anything blatantly wrong.
Are the evaluation documents similar to the training documents in length, vocabulary, etc? Ideally, they'd be a randomly-chosen subset of all available labeled examples. (If quite different, that might be a reason why cross-validation versus held-out-evaluation accuracy varies.)
When training the Doc2Vec model, are you giving each document a single unique ID as the only entry of its tags? Or are you using the label_n labels as the tags of your training examples? Or perhaps both? (Any of those are defensible choices, though I've found mixing known-labels into the usually 'unsupervised' Doc2Vec training, making it semi-supervised, often helps the mdoels' vectors become more useful as input to later explicitly-supervised classifiers.)
When I get unprecedented 'super-bad' accuracy in an unexpected step, often it's because some erroneous shuffling or re-ordering of the test examples has occurred – randomizing the real relationships. So it's worth doubling-checking for that, in code and by looking at a few examples in detail.
Re-inferring vectors for examples used in training, rather than simply asking for the trained-up vectors retained in the model, sometimes results in better vectors. However, many have observed that different-than-default parameters to infer_vector(), especially many-more steps and perhaps a starting alpha closer to that used during training, may improve results. (Also, inference seems to work better in fewer steps in the simpler PV-DBOW, dm=0, mode. PV-DM, dm=1, may especially require more steps.)
The tutorial you link shows a practice, calling train() multiple times while adjusting alpha yourself, that's generally unnecessary and error-prone – and specifically isn't likely to be doing the right thing in the latest gensim versions. You can leave the default alpha/min_alpha in place, and supply a preferred iter value during Doc2Vec initialization - and then one call to train() will automatically do that many passes, and glide the learning-rate down properly. And since the default iter is 5, if you don't set it, every call to train() is doing 5 passes - so doing your own external loop of 10 would mean 50 passes, and the code at that tutorial, with two calls to train() per loop for some odd reason, would mean 100 passes.