I have a dataset of 1.2mil single sentence descriptions (5-50 words) and I want to cluster these into n clusters. For vector conversion, I want to use doc2vec to get 1.2mil equal size vectors. However, I'm not sure what should be the size parameter. I've read, it should be between 100-300 however since each document, in this case, has fewer tokens (words) should the vector be small?
Your data – over a million texts, and perhaps tens-of-millions of words – is certainly large enough to try a default vector-size of 100 dimensions.
People with smaller datasets may need to try even smaller vector-sizes, but that's getting far from the cases where Doc2Vec ('Paragraph Vectors') works well.
But the actual best size for your dataset & goals is something you have to find out via experimentation. (If your dataset is dominated by 5-word texts, and if your vocabulary of unique words is quite small, maybe you'll need to try lower sizes, too.)
There's no one answer – the variety of your texts/vocabulary, & the patterns in your data, will affect the best choice. Only having your own project-specific repeatable evaluation, which you can use to compare alternate choices, can guid you to what's best.
Related
I have a dataset of over 1 million rows. Each row has 40 token words. Based on these tokens, a classification is made with a neural network. The vocabulary is 20,000 unique words. It is a binary classification problem. I set the size (dimension) of the vectors in gensim Word2Vec as 150 and saved these vectors for each data point in a json file. The json file's size is really huge: 250 GB. I cannot load this file into memory in one scoop as my RAM is only 128 GB. I am trying to see if I can reduce the physical size of these vectors by reducing them to the right size. I went through some of the suggestions made in this website such as Relation between Word2Vec vector size and total number of words scanned?. But the vector size is mentioned to be 100-300 and also depends on the problem.
Here is what I am doing:
# for training the word2vec model
w2vmodel = gensim.models.Word2Vec(one_mil_tokens,vector_size=150, window=2, min_count=1, sg=0, seed=1)
w2vmodel.save("w2model.trained")
and
model = gensim.models.Word2Vec.load("w2model.trained")
vec = []
finalvecs = []
#tokens is a list of over a 1 million rows
for token in tokens:
for word in token:
vec.append(model.wv[eachtoken].tolist())
finalvecs.append(vec)
I am doing json.dump() for finalvecs.
How can I determine the right size (dimension) of the vector for each token based on the given problem?
I use skip-gram model to train Word2Vec. Should I use CBOW to optimize the size?
Is json the fight format to store/retrieve these vectors or are there other efficient ways?
Each dimension of a dense vector is typically a 32-bit float.
So, storing 20,000 token-vectors of 150 dimensions each will take at least 20000 vectors * 150 floats * 4 bytes/float = 12MB for the raw vector weights, plus some overhead for remembering which token associates with which line.
Let's say you were somehow changing each of your rows into a single summary vector of the same size. (Perhaps, by averaging together each of the ~40 token vectors into a single vector – a simple baseline approach, though there are many limitations of that approach, & other techniques that might be used.) In that case, storing the 1 million vectors will necessarily take about 1000000 vectors * 150 floats * 4 bytes/float = 600MB for the raw vector weights, plus some overhead to remember which row associates with which vector.
That neither of these is anywhere close to 250GB implies you're making some other choices expanding things significantly. JSON is a poor choice for compactly representing dense floating-point numerical data, but even that is unlikely to explain the full expansion.
Your description that you "saved these vectors for each data point in a json file" isn't really clear what vectors are being saved, or in what sort of JSON conventions.
Perhaps you're storing the 40 separate vectors for each row? That'd give a raw baseline weights-only size of 1000000 rows * 40 tokens/row * 150 floats * 4 bytes/float = 24GB. It is plausible inefficient JSON is expanding the stored-size by something like 10x, so I guess you're doing something like this.
But in addition to the inefficiency of JSON, given that the 40 tokens (from a vocabulary of 20k) each given enough info to reconstitute any other per-row info that's solely a function of the tokens & the 20k word-vectors, there's not really any reason to expand the representations this way.
For example: If the word 'apple' is already in your dataset, and appears many thousands of times, there's no reason to re-write the 150 dimensions of 'apple' many thousands of times. The word 'apple' alone is enough to call-back those 150 dimensions, whenever you want them, from the much-smaller (12MB) set-of-20k token-vectors, that's easy to keep in RAM.
So mainly: ditch JSON, don't (unnecessarily) expand each row into 40 * 150 dimensions.
To your specific questions:
The optimal size will vary based on lots of things, including your data & other goals. The only way to rationally choose is to figure out some way to score the trained vectors on your true end goals: some repeatable way of comparing multiple alternate choices of vector_size. Then you run it every plausible way & pick the best. (Barring that, you take a random stab based on some precedent work that seems roughly similar in data/goals, and hope that works OK until you have the chance to compare it against other choices.)
The choice of skip-gram or CBOW wont affect the size of the model at all. It might affect end result quality & training times a bit, but the only way to choose is to try both & see which works better for your goals & constraints.
JSON is an awful choice for storing dense binary data. Representing numbers as just text involves expansion. The JSON formatting characters add extra overhead, repeatedly on every tow, that's redundant if every row is the exact same 'shape' of raw data. And, typical later vector operations in RAM usually work best on the same sort of maximally-compact raw in-memory representation that would also be best on disk. (In fat, the best on-disk representation will often be data that exactly matches the format in memory, so that data can be "memory-mapped" from disk to RAM in a quick direct operation that minimizes format-wrangling & even defers accesses until reads needed.)
Gensim will efficiently save its models via their build-in .save() method, into one (or more often several) related files on disk. If your Gensim Word2Vec model is in the variable w2v_model, you can just save the whole model with w2v_model.save(YOUR_FILENAME) – & later reload it with Word2Vec.load(YOUR_FILENAME).
But if after training you only need the (20k) word-vectors, you can just save the w2v_model.wv property – just the vectors: w2v_model.wv.save(YOUR_FILENAME). Then you can reload them as an instance of KeyedVectors: KeyedVectors.load(YOUR_FILENAME).
(Note in all cases: the save may be spread over multiple files, which if ever copied/moved elsewhere, should be kept together - even though you only ever specify the 'root' file of each set in save/load operations.)
How & whether you'd want to store any vectorization of your 1 million rows would depend on other things not yet specified, including the character of the data, and the kinds of classification applied later. I doubt that you want to turn your rows into 40 * 150 6000-dimensions – that'd be counter to some of the usual intended benefits of a Word2Vec-based analysis, where the word apple has much the same significance no matter where it appears in a textual list-of-words.
You'd have to say more about your data, & classification goals, to get a better recommendation here.
If you haven't already done a 'bag-of-words' style representation of your rows (no word2vec), where every row is represented by a 20000-dimension one-hot sparse vector, and run a classifier on that, I'd recommend that first, as a baseline.
Are your word embeddings for each unique token static or context-dependent?
If they are static (always the same for the exact same word), it would make sense to get all unique tokens once and assign each token an integer ID (mapping id2string and string2id). There also are Gensim functions to do that. Then get the vector of length 150 for each unique string token and put it in something like a dictionary (so mapping id2vector or string2vector).
Then you just have to save the ids for each row (assuming it is fine for you to discard all out-of-vocabulary tokens) as well as the mapping dict. So instead of saving ~40m strings (the texts) and ~6b (40m * 150 vector size) floats you just have to save ~20k strings + 3m floats (20k * 150) and ~40m ints which should take drastically less space.
As for the dimension of the vector: I would expect the performance to be better with larger vectors (e.g. 300 is probably better than 100, if you have enough training data to train such detailed embeddings), so if you are looking for performance instead of optimizing efficiency I'd try to keep the vector size not too small.
I am trying to train a text classifier with FastText. It has a bunch of options along with a facility to train from the command line. One of the options is wordNgrams.
In my particular dataset, I discovered that many irrelevant queries are being classified with high confidence because they share similar tokens. So, my plan was to ignore the unigram tokens and start from bigram. Now I go from 1st gram up to 5th gram by setting wordNgrams = 5, but my plan is to go from 2nd gram to 5th gram. But it seems to be that FastText doesn't support this. Is there any way to achieve this, this is required to minimize these False Positives.
As far as I can tell, even though Facebook's fasttext lets users set a range for character-n-grams (subword information), using -minn & -maxn, it only offers a single -wordNgrams parameter setting the maximum length of word-multigrams.
However, it is also the case that the -supervised mode combines all the given tokens in an order-oblivious way. Thus, you could in your own preprocessing create whatever mix of n-grams (or other token-represented features) you'd like, then pass those to fasttext (which it would consider as all unigrams). As long as you apply the same preprocessing in training as in later classification, the effect should be the same.
(You could even use the sklearn CountVectorizer's preprocessing.)
Keep in mind the warning from ~Erwan in a comment, though: adding so many distinct features increases the risk of overfitting, which could show up as your stated problem "many irrelevant queries are being classified with high confidence because they share similar tokens". (The model, made large by the inclusion of so many n-grams, has memorized idiosyncratic minutia from the training-data. That then leads it astray applying non-generalizable inferences to out-of-training data.)
I have tried using word2vec but I want to change the similarity between two words. Preferably, not one by one manually. Another option I was considering was creating a corpus that would enforce the correct similarity but I don't know how to do this. Thank you for any advice.
Why? Word2Vec uses large amounts of real-world usage data to create word-vectors that are useful for certain things because they accurately reflect the relationships in the training text.
Changing any vectors' position is, in one sense, trivial: just modify the array to whatever values you want, Make all their dimensions zeros! All 100.0! Whatever!
For example, if you want the words 'apple' and 'orange' to have identical vectors, and thus ~1.0 similarity, it's easy to change one to the other. Assuming you've used the popular Python Gensim library to train a Word2Vec model into my_w2v_model:
my_wv = my_w2v_model.wv
print(my_wv.similarity('apple', 'orange'))
my_wv['apple'] = my_wv['orange']
print(my_wv.similarity('apple', 'orange'))
But, now the model has lost any idea of the apple/orange distinction, and the 'apple' vector will now have no neighbors or value other than as an exact synonym of 'orange'.
So, since such changes can destroy the reason for using word-vectors in the 1st place, it's important to know what kind of change, & for what hoped-for benefits, you're seeking.
Maybe you want to clobber end-values directly, or nudge words a little bit, or something else. In particular, if you mostly want words to retain their relationships with other words, you'll want to make more subtle changes.
In some cases, it might make the most sense to change or extend the training data, to shift the model training towards the similarities you want. As one rough quickie example, you could consider preprocessing the data to take every text where 'apple' appears, and with 50% probability , replace 'apple' with 'orange' (and vice-versa for 'orange'). That'd tend to confuse the two in the training texts, and thus result in highly-similar end-vectors, which each still (by the influence of the unchanged texts) is also much like the original word/word-neighbors.
(With more details of your aims, more specific suggestions might be possible.)
In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.
There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector() method. However, like Doc2Vec in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector(), so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1) does this, or PV-DBOW (dm=0) if you add the optional interleaved word-vector training (dbow_words=1). (If you use dm=0, dbow_words=0, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.
I'm playing with gensim's wordvec and try to build a model using the terms from a large medical thesaurus as sentences. There are about 1 million terms (most of the multiword terms which I treat as sentences) and the hope is, that if word2vec sees terms like "breast cancer" and "breast tumor" etc. it will be able to conclude that "cancer" and "tumor" are somewhat similar.
I run experiments in which I track how similar terms like that are when using different numbers of iterations but it seems that the results don't correlate. I'd expect that when considering word pairs like (wound, lesion), (thorax, lung), (cancer, tumor) etc, when going from 5 to 100 iterations there'd be a tendency (even if small) that the one word in the pair is "more similar" to the the other as the number of iterations grows. But no, results appear pretty random or even getting worse.
Specifically: I loop with 1,5,10,20,50,100 iterations and train a w2v model and then for my word pairs above check the rank of the 2nd word in the list (say "lung") of similar words (as returned by w2v) for the first word (say "thorax"), then sum up and build the average. And the average rank is growing (!) not decreasing, meaning as training proceeds, the vectors for "lung" and "thorax" move further and further away from each other.
I didn't expect gensim to detect the clean synonyms and also perhaps 'only' 1 million terms (sentences) is not enough, but still I am puzzled by this effect.
Does anyone have a suspicion?
====================================================
Added after comments and feedback came in:
Thanks for the detailed feedback, gojomo. I had checked many of these issues before:
yes, the thesaurus terms ("sentences") come in the right format, e.g. ['breast', 'cancer']
yes, of the ~1mio terms more than 850.000 are multiword. Clear that 1-word terms won't provide any context. But there should be ample evidence from the multiword terms
the examples I gave ('clinic', 'cancer', 'lung', ...) occur in many hundreds of terms, often many thousands. This is what I find odd: That not even for words this frequent really good similar words are suggested.
you ask for the code: Here it is https://www.dropbox.com/s/fo3fazl6frj99ut/w2vexperiment.py?dl=0 It expects to be called (python3) with the name of the model and then the SKOS-XML files of a large thesaurus like Snomed
python w2vexperiment.py snomed-w2v.model SKOS/*.skos
I the code you see that I create a new model with each new experiment (with a different number of iterations) So there should be no effect that one run pollutes the other (wrong learning rate etc...)
I have set min_count to 10
Still: the models don't get better but often worse as number of iterations grows. And even the better ones (5 or 10 iterations) give me strange results for my test words...
I suspect there's something wrong with your corpus prep, or training – usually word2vec can rank such similarities well.
Are you supplying the terms alone (eg ['breast, 'tumor'] or ['prophylaxis'] as very tiny sentences), or the terms plus definitions/synonyms as somewhat longer sentences?
The latter would be better.
If the former, then 1-word 'sentences' achieve nothing: there's no neighboring 'context' for word2vec to learn anything, and they're essentially skipped.
And mere 2-word sentences might get some effect, but don't necessarily provide the kind of diverse contexts helpful for training to induce the useful vector arrangements.
Also if it's 1-million 'sentences' of just 1-4 words each, it's kind of a small dataset, and individual words might not be appearing often enough, in sufficiently slightly-varied contexts, for them to get good vectors. You should check the words/tokens of interest, in the model.wv.vocab dict, for a count value that indicates there were enough examples to induce a good vector - ideally 10+ occurrences each (and more is better).
So: more data, and more diverse usages from the relevant domain, are always a good idea. A thesaurus with synonyms in each 'sentence', that are many words (5 to dozens), might be enough.
You don't show your code or training-parameters, but people tweaking the defaults, or following outdated online examples, can often sabotage the algorithm's effectiveness.
For example, it's distressingly common to see people who call train() multiple times, in their own iteration loop, to mismanage the learning-rate alpha such that some iterations run with a negative alpha – meaning every backpropagation serves to drive the context-vectors towards lower target-word predictiveness, the exact opposite of what should be happening. (It's best to either supply the corpus & iter on Word2Vec initialization, or call train() just once. Only advanced tinkerers should need to call train() multiple times.)
Similarly, while naive intuition is often "keeping more words/info must be better", and thus people lower min_count to 1 or 0, such low-frequency words can't get good vectors with just 1 (or a few) occurences, but since they are very numerous (in total), they can interfere with other words' meaningful training. (The surviving, more-frequent words get better vectors when low-frequency words are discarded.)
Good luck!