Unsupervised Clustering of Words in a document semantically

Unsupervised Clustering of Words in a document semantically - python

I want to cluster words based on their semantic similarity. Currently I have a list of documents with detected noun phrases in them. I want to make cluster out of these obtained nouns within the documents and unsupervisedly cluster them semantically?
I have looked at wordnet and gensim libraries. Any suggestions as to which can really help in getting the required cluster of words based on their semantic similarity?

For similarity based on phrase co-occurrence (phrases appearing more often together in documents will be more similar), you can use gensim.
Check out the Latent Semantic Analysis and Latent Dirichlet Allocation there: http://radimrehurek.com/gensim/tut2.html#available-transformations
Depending on what exactly you want your clusters to do, you can either use the LSI/LDA topics directly as clusters. Or cluster the obtained latent phrase vectors etc.

Related

Using word embeddings to train for document similarity or topic classification

This is just a general question since i started on a course but it is closed during christma/ny
I know that word embeddings are mathematical representation of words, my question if I had a dataset which had a large corpus representing the unique words in different documents. When it learn the word embeddings of this corpus can it then be used for classification tasks such as the probability that a given document is said topic
Further for my own curiosity and because i want to get into this industry, what would be a good NLP task to do to represent on my cv? i've got the hang of selenium, pytorch and opencv

Inconsistencies between bigrams found by TfidfVectorizer and Word2Vec model

I am building a topic model from scratch, one step of which uses the TfidfVectorizer method to get unigrams and bigrams from my corpus of texts:
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.9, ngram_range = (1,2))
After topics are created, I use the similarity scores provided by gensim's Word2Vec to determine coherence of topics. I do this by training on the same corpus:
bigram_transformer = Phrases(corpus)
model = Word2Vec(bigram_transformer[corpus], min_count=1)
For many of the bigrams in my topics however, I get a KeyError because that bigram was not picked up in the training of Word2Vec, despite them being trained on the same corpus. I think this is because Word2Vec decides on which bigrams to choose based on statistical analysis (Why aren't all bigrams created in gensim's `Phrases` tool?)
Is there a way to get the Word2Vec to include all those bigrams identified by TfidfVectorizer? I see trimming capabilities such as 'trim_rule' but not anything in the other direction.

The point of the Phrases model in Gensim is to pick some bigrams, which are calculated to be statistically-significant.
If you then apply that model's determinations as a preprocessing step on your corpus, certain pairs of unigrams will be outright replaced in your text with the combined bigram. (As such, it's possible some unigrams that were there originally will no longer appear even once.)
Thus the concepts of bigrams as used by Gensim's Phrases and the TfidfVectorizer's ngram_range facility are different. Phrases is meant for destructive replacements where specific bigrams are inferred to be more interesting than the unigrams. TfidfVectorizer will add extra bigrams as additional dimensional features.
I suppose the right tuning of Phrases could cause it to consider every bigram as significant. Without checking, it looks like a super-tiny value, like 0.0000000001, might have essentially that effect. (The Phrases class will reject a value of 0 as nonsensical given its usual use.)
But at that point, your later transformation (via bigram_transformer[corpus]) will combine every possible pair of words before Word2Vec training. For example, the sentence:
['the', 'skittish', 'cat', 'jumped', 'over', 'the', 'gap',]
...would indiscriminately become...
['the_skittish', 'cat_jumped', 'over_the', 'gap',]
It seems unlikely that you want that, for a number of reasons:
There might then be no training texts with the 'cat' unigram alone, leaving you with no word-vector for that word at all.
Bigrams that are rare or of little grammatical value (like 'the_skittish') will receive trained word-vectors, & take up space in the model.
The kinds of text corpus that are large enough for good Word2Vec results might have far more bigrams than are manageable. (A corpus small enought that you can afford to track every bigram may be on the thin side for good Word2Vec results.)
Further, to perform that greedy-combination of all bigrams, the Phrases frequency-survey & calculations aren't even necessary. (It can be done automatically with no preparation/analysis.)
So, you shouldn't expect every bigram of TfidfVectorizer to be get a word-vector, unless you take some extra steps, outside the normal behavior of Phrases, to ensure every such bigram was in the training texts.
To try to do so wouldn't necessarily need Phrases at all, and might be unmanageable, and involve other tradeoffs. (For example, I could imagine repeating the corpus many times, only combining a fraction of the bigrams each time – so that each is sometimes surrounded by other unigrams, and sometimes by other bigrams – to create a synthetic corpus with enough meaningful texts to create all your desired vectors. But the logic & storage space for that model would be larger & complicated, and without prominent precedent, so it'd be a novel experiment.)

Can I get topics distribution of a word in LDA?

I'm new to LDA and I want to calculate the topic similarity between words. Can I get the topic distribution of a word? If so, how can I do this in gensim.ldamodel?

Gensim's LDA mallet wrapper has a load_word_topics() function (I would assume this is true for its python LDA implementation as well). It returns a matrix that is words X topics. From that, you can get a vector of frequencies each word in each topic. That would be the topic-word distribution for a given word.

Use tf-idf with FastText vectors

I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:
http://dsgeek.com/2018/02/19/tfidf_vectors.html
https://www.aclweb.org/anthology/P16-1089
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.
For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.
But for fastText same word will have several n-grams,
"Listen to the latest news summary" will have n-grams generated by a sliding windows like:
lis ist ste ten tot het...
These n-grams are handled internally by the model so when I try:
model["Listen to the latest news summary"]
I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:
model['lis']
model['ist']
model['ten']
And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.

I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.
That is, you send
model["Listen"]
model["to"]
model["the"]
...
to FastText, and then use your old code to get the tf-idf weights.
In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.

You are talking about fasttext tokenization step (not fasttext embeddings) which is a (3,6) char-n-gram tokenization, compatible with tfidf. The full step can be computed outside of fasttext quite easily Calculate TF-IDF using sklearn for n-grams in python

For what I understood from your question you are confusing the difference between word embeddings methods (such as word2vec and many other) and Tf-Idf:
Basically Word Embeddings methods are unsupervised models for
generating word vectors. The word vectors generated by this kind of
models are now very popular in NPL tasks. This is because a word
embedding representation of a word captures more information about
a word than just a one-hot representation of the word, since the
former captures semantic similarity of that word to other words
whereas the latter representation of the word is equidistant from all
other words. FastText is another way to implements word embedding (recently opensourced by facebook researcher).
Tf-idf, instead is a scoring scheme for words, that is a measure of how
important a word is to a document.
From a practical usage standpoint, while tf-idf is a simple scoring scheme and that is its key advantage, word embeddings may be a better choice for most tasks where tf-idf is used, particularly when the task can benefit from the semantic similarity captured by word embeddings (e.g. in information retrieval tasks).
Unlike Word2Vec that learn a vector representation of the entire word, FastText learn a representation for each n-gram of the word as you already seen. So the overall word embeddings is the sum of the n-gram representation. Basically FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.
For my standpoint in general It does not make sense use FastText (or any word embeddings methods) together with Tf-Idf. But if you want use Tf-Idf with FastText you must sum all the n-gram that compose your word and use this representation to calculate the Tf-Idf.

Python Calculating similarity between two documents using word2vec, doc2vec

I am trying to calculate similarity between two documents which are comprised of more than thousands sentences.
Baseline would be calculating cosine similarity using BOW.
However, I want to capture more of semantic difference between documents.
Hence, I built word embedding and calculated documents similarity by generating document vectors by simply averaging all the word vectors in each of documents and measure cosine similarity between these documents vectors.
However, since the size of each input document is rather big, the results I get from using the method above are very similar to simple BOW cosine similarity.
I have two questions,
Q1. I found gensim module offers soft cosine similarity. But I am having hard time understanding the difference from the methods I used above, and I think it may not be the mechanism to calculate similarity between million pairs of documents.
Q2. I found Doc2Vec by gensim would be more appropriate for my purpose. But I recognized that training Doc2Vec requires more RAM than I have (32GB) (the size of my entire documents is about 100GB). Would there be any way that I train the model with small part(like 20GB of them) of entire corpus, and use this model to calculate pairwise similarities of entire corpus?
If yes, then what would be the desirable train set size, and is there any tutorial that I can follow?

Ad Q1: If the similarity matrix contains the cosine similarities of the word embeddings (which it more or less does, see Equation 4 in SimBow at SemEval-2017 Task 3) and if the word embeddings are L2-normalized, then the SCM (Soft Cosine Measure) is equivalent to averaging the word embeddings (i.e. your baseline). For a proof, see Lemma 3.3 in the Implementation Notes for the SCM. My Gensim implementation of the SCM (1, 2) additionally sparsifies the similarity matrix to keep the memory footprint small and to regularize the embeddings, so you will get slightly different results compared to vanilla SCM. If embedding averaging gives you similar results to simple BOW cosine similarity, I would question the quality of the embeddings.
Ad Q2: Training a Doc2Vec model on the entire dataset for one epoch is equivalent to training a Doc2Vec model on smaller segments of the entire dataset, one epoch for each segment. Just be aware that Doc2Vec uses document ids as a part of the training process, so you must ensure that the ids are still unique after the segmentation (i.e. the first document of the first segment must have a different id than the first document of the second segment).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.