Gensim's Word2Vec model takes as an input a list of lists with the inner list containing individual tokens/words of a sentence. As I understand Word2Vec is used to "quantify" the context of words within a text using vectors.
I am currently dealing with a corpus of text that has already been split into individual tokens and no longer contains an obvious sentence format (punctuation has been removed). I was wondering how should I input this into the Word2Vec model?
Say if I simply split the corpus into "sentences" of uniform length (10 tokens per sentence for example), would this be a good way of inputting the data into the model?
Essentially, I am wondering how the format of the input sentences (list of lists) affects the output of Word2Vec?
That sounds like a reasonable solution. If you have access to data that is similar to your cleaned data you could get average sentence length from that data set. Otherwise, you could find other data in the language you are working with (from wikipedia or another source) and get average sentence length from there.
Of course your output vectors will not be as reliable as if you had the correct sentence boundaries, but it sounds like word order was preserved so there shouldn't be too much noise from incorrect sentence boundaries.
Most typically, text is passed to Word2Vec in logical units (like sentences or paragraphs). Also, the published papers and early demo code tended to convert punctuation into tokens, as well.
But text without punctuation, and arbitrary breaks between texts, are a reasonable workaround and still give pretty good results.
For example, the text8/text9 corpuses often used in demos (including the word2vec intro Jupyter notebook bundled in gensim) are just giant runs-of-words, lacking punctuation and line-breaks. So, the utility LineSentence class used in gensim will break them into individual 10,000-token texts.
It's probably better to go larger in your arbitrary breaks (eg 10,000), rather than smaller (eg 10), for a couple reasons:
source texts are usually longer than 10 words
often the source material that was run-together was still semantically-related across its original boundaries
the optimized algorithms work better on larger chunks of data
the harm of "false context windows" (created by the concatenation) is probably just noise with no net biasing effect, while more "true windows" (by creating as few false splits as possible) likely retains more of the original corpus' learnable word-to-word relationships signal
you can always simulate more-conservative contexts with a smaller window parameter (if the original source really did have tiny sentences that weren't sequentially-related)
But, gensim's cython-optimized training path has an implementation limit of 10,000 tokens per text – with any more being silently ignored – so you wouldn't intentionally want to supply longer texts for any reason.
Related
I need to train a model in Python based on word2vec or other models in order to get adjectives which are semantically close to a world.
For example give a word like 'cat' to model and receive adjectives like 'cute', 'nice', etc.
Is there any way?
With any word2vec model – whether you train it on your own data, or download someone else's pre-trained model – you can give it a word like cat and receive back a ranked list of words that are considered 'similar' in its coordinate system.
However, these won't normally be limited to adjectives, as typical word2vec models don't take any note of a word's part-of-speech. So to filter to just adjectives, some options could include:
use a typical word2vec set-of-vectors that is oblivious to part-of-speech, but use some external reference (like say WordNet) to check each returned word, and discard those that can't be adjectives
preprocess a suitable training corpus to label words with their part-of-speech before word2vec training, as is sometimes done. Then your model's tokens will include within them a declared part-of-speech. For example, you'd then no longer have the word good alone as a token, but (depending on what conventions you use) tagged-tokens like good/NOUN & good/ADJ instead. Then, filtering the closest-words to just adjectives is a simple matter of checking for the desired string pattern.
However, the words you receive from any such process based on word2vec might not be precisely what you're looking for. The kinds of 'semantic similarity' captured by word2vec coordinates are driven by how well words predict other nearby words under the model's limitations. Whether these will meet your needs is something you'll have to try; there could be surprises.
For example, words that humans consider antonyms, like hot & cold, will still be relatively close to each other in word2vec models, as they both describe the same aspect of something (its temperature), and often appear in the same surrounding-word contexts.
And, depending on training texts & model training parameters, different word2vec models can sometimes emphasize different kinds of similarity in their rankings. Some have suggested, for example, that using a smaller window can tend to place words that are direct replacements for each other (same syntactic roles) closer together, whereas a larger window will somewhat more bring together words used in the same topical domains (even if they aren't of the same type). Which kind of similarity would be better for your need? I'm not sure; if you have the time/resources, you could compare the quality of results from multiple contrasting models.
I built a box-embedding model on the latest wikipedia articles dump and i need to compare it with the word2vec model in gensim. I saw that if i generate the corpus data as a txt file using get_texts() method in class WikiCorpus there are a lot of stop words, so this make me think that WikiCorpus doesn't delete stop words isn't it?. Now once trained my box model on the wiki corpus txt i notice that calling the "most similar" function that i create appositely for box embedding prints very often stop words, instead the same word passed to the most similar function of word2vec model trained on the same corpus txt produce best results. Can someone suggest me why Word2vec model fit so well despite the corpus txt have a lot of stop words instead my box model on the same corpus not?
How did you train a box-embedding, and why did you think it would offer good most_similar() results?
From a (very) quick glance at the 'BoxE' paper by Abboud et al (2020), it appears to require training based on a knowledge base representation – not the raw text that'd come from WikiCorpus. (So I'd not necessarily expect a properly-trained BoxE embedding would have 'stop words' in it at all.)
And, BoxE appears to be optimized for evaluating potential facts – not more general most_similar rankings. So I'd not expect a simple most_similar listing from it to necessarily be expressive.
In usual word2vec, removing stop-words isn't very important and plenty of published work doesn't bother doing so. The downsampling of highly-frequent words already tends to ignore many stop-word occurrences – and their highly diverse usage contexts mean they are likely to get weak word-vectors not especially close to other more-narrow-meaning word-vectors.
So in usual word2vec, stop-words aren't very likely to be in the top-neighbors, by cosine-similarity, of other more-specific entity words.
I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this
Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.
I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword.
My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. Now my file here is independent and contains different keywords in each row.
My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts
**My File Structure: **
Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen
Expected
search Text : 'WAL'
Result from My File:
WALGREEN
WALMART
WALMART
Embeddings
Lets step back and understand what is word2vec. Word2vec (like Glove, FastText etc) is a way to represent words as vectors. ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). One-hot encoding is one naive way of encoding words as vectors. But for a large vocabulary one-hot encoding become too long. Also there is no semantic relationship between one-hot encoded word.
With DL came the distributed representation of words (called word embeddings). One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. i.e distance(apple,orange) < distance(apple,cat)
So how are these embedding model trained ? The embedding models are trained on (very) huge corpus of text. When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. It will learn that the apple and orange are related. So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context).
However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings.
Out of vocabulary (OOV) words
Word embedding like word2vec and Glove cannot return an embedding for OOV words. However the embeddings like FastText (thanks to #gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word.
Problem
Coming to your problem,
Case 1: lets say the user enters a word WAL, first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. Embeddings like FastText handling them by breaking it into n-grams. This approach gives good embeddings for misspelled words or slang.
Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. It will rather be close to words like window, paint, door.
Conclusion
If your search is for semantically similar words, then solution using vector embeddings will be good. On the other hand, if your search is based on lexicons then vectors embeddings will be of no help.
If you wanted to find walmart from a fragment like wal, you'd more likely use something like:
a substring or prefix search through all entries; or
a reverse-index-of-character-n-grams; or
some sort of edit-distance calculated against all entries or a subset of likely candidates
That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words.
If in fact you want to find similar stores, word-vectors might theoretically be useful. But the problem given your example input is that such word-vector algorithms require examples of tokens used in context, from sequences-of-tokens that co-appear in natural-language-like relationships. And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships.
While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. Some ideas might be:
lists of stores visited by a single customer
lists of stores carrying the same product/UPC
text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.)
In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.
There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector() method. However, like Doc2Vec in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector(), so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1) does this, or PV-DBOW (dm=0) if you add the optional interleaved word-vector training (dbow_words=1). (If you use dm=0, dbow_words=0, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.