Which document embedding model for document similarity

Which document embedding model for document similarity - python

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?

You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

Related

Is there a pretrained Gensim phrase model?

Is there a pretrained Gensim's Phrases model? If not, would it be possible to reverse engineer and create a phrase model using a pretrained word embedding?
I am trying to use GoogleNews-vectors-negative300.bin with Gensim's Word2Vec. First, I need to map my words into phrases so that I can look up their vectors from the Google's pretrained embedding.
I search on the official Gensim's documentation but could not find any info. Thanks!

I'm not aware of anyone sharing a Phrases model. Any such model would be very sensitive to the preprocessing/tokenization step, and the specific parameters, the creator used.
Other than the high-level algorithm description, I haven't seen Google's exact choices for tokenization/canonicalization/phrase-combination done to the data that fed into the GoogleNews 2013 word-vectors have been documented anywhere. Some guesses about preprocessing can be made by reviewing the tokens present, but I'm unaware of any code to apply similar choices to other text.
You could try to mimic their unigram tokenization, then speculatively combine strings of unigrams into ever-longer multigrams up to some maximum, check if those combinations are present, and when not present, revert to the unigrams (or largest combination present). This might be expensive if done naively, but be amenable to optimizations if really important - especially for some subset of the more-frequent words – as the GoogleNews set appears to obey the convention of listing words in descending frequency.
(In general, though it's a quick & easy starting set of word-vectors, I think GoogleNews is a bit over-relied upon. It will lack words/phrases and new senses that have developed since 2013, and any meanings it does capture are determined by news articles in the years leading up to 2013... which may not match the dominant senses of words in other domains. If your domain isn't specifically news, and you have sufficient data, deciding your own domain-specific tokenization/combination will likely perform better.)

Sentences embedding using word2vec

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.

Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance

If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this

Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.

Word-sense disambiguation based on sets of words using pre-trained embeddings

I am interested in identifying the WordNet synset IDs for each word in a set of tags.
The words in the set provide the context for the word sense disambiguation, such as:
{mole, skin}
{mole, grass, fur}
{mole, chemistry}
{bank, river, river bank}
{bank, money, building}
I know of the lesk algorithm and libraries, such as pywsd, which is based on 10+ year old tech (which may still be cutting edge -- that is my question).
Are there better performing algorithms by now that make sense of pre-trained embeddings, like GloVe, and maybe the distances of these embeddings to each other?
Are there ready-to-use implementations of such WSD algorithms?
I know this question is close to the danger zone of asking for subjective preferences - as in this 5-year old thread. But I am not asking for an overview of options or the best software for a problem.

Transfer learning, particularly models like Allen AI’s ELMO, OpenAI’s Open-GPT, and Google’s BERT allowed researchers to smash multiple benchmarks with minimal task-specific fine-tuning and provided the rest of the NLP community with pretrained models that could easily (with less data and less compute time) be fine-tuned and implemented to produce state of the art results.
these representations will help you accuratley retrieve results matching the customer's intent and contextual meaning(), even if there's no keyword or phrase overlap.
To start off, embeddings are simply (moderately) low dimensional representations of a point in a higher dimensional vector space.
By translating a word to an embedding it becomes possible to model the semantic importance of a word in a numeric form and thus perform mathematical operations on it.
When this was first possible by the word2vec model it was an amazing breakthrough. From there, many more advanced models surfaced which not only captured a static semantic meaning but also a contextualized meaning. For instance, consider the two sentences below:
I like apples.
I like Apple macbooks
Note that the word apple has a different semantic meaning in each sentence. Now with a contextualized language model, the embedding of the word apple would have a different vector representation which makes it even more powerful for NLP tasks.
contextual embedding's like BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.

Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Use NLTK to find reasons within text

For my project at work I am tasked with going through a bunch of user generated text, and in some of that text are reasons for cancelling their internet service, as well as how often that reason is occurring. It could be they are moving, just don't like it, or bad service, etc.
While this may not necessarily be a Python question, I am wondering if there is some way I can use NLTK or Textblob in some way to determine reasons for cancellation. I highly doubt there is anything automated for such a specialized task and I realize that I may have to build a neural net, but any suggestions on how to tackle this problem would be appreciated.
This is what I have thought about so far:
1) Use stemming and tokenization and tally up most frequent words. Easy method, not that accurate.
2) n-grams. Computationally intensive, but may hold some promise.
3) POS tagging and chunking, maybe find words which follow conjunctions such as "because".
4) Go through all text fields manually and keep a note of reasons for cancellation. Not efficient, defeats the whole purpose of finding some algorithm.
5) NN, have absolutely no idea, and I have no idea if it is feasible.
I would really appreciate any advice on this.

Don't worry if this answer is too general or you can't understand
something - this is academic stuff and needs some basic preparations.
Feel free to contact me with questions, if you want (ask for my mail
in comment or smth, we'll figure something out).
I think that this question is more suited for CrossValidated.
Anyway, first thing that you need to do is to create some training set. You need to find as many documents with reasons as you can and annotate them, marking phrases specifying reason. The more documents the better.
If you're gonna work with user reports - use example reports, so that training data and real data will come from the same source.
This is how you'll build some kind of corpus for your processing.
Then you have to specify what features you'll need. This may be POS tag, n-gram feature, lemma/stem, etc. This needs experimentation and some practice. Here I'd use some n-gram features (probably 2-gram or 3-gram) and maybe some knowledge basing on some Wordnet.
Last step is building you chunker or annotator. It is a component that will take your training set, analyse it and learn what should it mark.
You'll meet something called "semantic gap" - this term describes situation when your program "learned" something else than you wanted (it's a simplification). For example, you may use such a set of features, that your chunker will learn finding "I don't " phrases instead of reason phrases. It is really dependent on your training set and feature set.
If that happens, you should try changing your feature set, and after a while - try working on training set, as it may be not representative.
How to build such chunker? For your case I'd use HMM (Hidden Markov Model) or - even better - CRF (Conditional Random Field). These two are statistical methods commonly used for stream annotation, and you text is basically a stream of tokens. Another approach could be using any "standard" classifier (from Naive Bayes, through some decision tress, NN to SVM) and using it on every n-gram in text.
Of course choosing feature set is highly dependent on chosen method, so read some about them and choose wisely.
PS. This is oversimplified answer, missing many important things about training set preparation, choosing features, preprocessing your corpora, finding sources for them, etc. This is not walk-through - these are basic steps that you should explore yourself.
PPS. Not sure, but NLTK may have some CRF or HMM implementation. If not, I can recommend scikit-learn for Markov and CRFP++ for CRF. Look out - the latter is powerful, but is a b*tch to install and to use from Java or python.
==EDIT==
Shortly about features:
First, what kinds of features can we imagine?
lemma/stem - you find stems or lemmas for each word in your corpus, choose the most important (usually those will have the highest frequency, or at least you'll start there) and then represent each word/n-gram as binary vector, stating whether represented word or sequence after stemming/lemmatization contains that feature lemma/stem
n-grams - similiar to above, but instead of single words you choose most important sequences of length n. "n-gram" means "sequence of length n", so e.g. bigrams (2-grams) for "I sat on a bench" will be: "I sat", "sat on", "on a", "a bench".
skipgrams - similiar to n-grams, but contains "gaps" in original sentence. For example, biskipgrams with gap size 3 for "Quick brown fox jumped over something" (sorry, I can't remember this phrase right now :P ) will be: ["Quick", "over"], ["brown", "something"]. In general, n-skipgrams with gap size m are obtained by getting a word, skipping m, getting a word, etc unless you have n words.
POS tags - I've always mistaken them with "positional" tags, but this is acronym for "Part Of Speech". It is useful when you need to find phrases that have common grammatical structure, not common words.
Of course you can combine them - for example use skipgrams of lemmas, or POS tags of lemmas, or even *-grams (choose your favourite :P) of POS-tags of lemmas.
What would be the sense of using POS tag of lemma? That would describe part of speech of basic form of word, so it would simplify your feature to facts like "this is a noun" instead of "this is plural female noun".
Remember that choosing features is one of the most important parts of the whole process (the other is data preparation, but that deserves the whole semester of courses, and feature selection can be handled in 3-4 lectures, so I'm trying to put basics here).
You need some kind of intuition while "hunting" for chunks - for example, if I wanted to find all expressions about colors, I'd probably try using 2- or 3-grams of words, represented as binary vector described whether such n-gram contains most popular color names and modifiers (like "light", "dark", etc) and POS tag. Even if you'd miss some colors (say, "magenta") you could find them in text if your method (I'd go with CRF again, this is wonderful tool for this kind of tasks) generalized learned knowledge enough.

While FilipMalczak's answer states the state-of-the-art method to solve your problem, a simpler solution (or maybe a preliminary first step) would be to do simple document clustering. This, done right, should cluster together responses that contain similar reasons. Also for this, you don't need any training data. The following article would be a good place to start: http://brandonrose.org/clustering

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.