I wonder what does .build_vocab_from_freq() function from gensim actually do? What is the difference when I'm not using it? Thank you!
It "builds a vocabulary from a dictionary of word frequencies". You need a vocabulary for your gensim models. Usually you build it from your corpus. This is basically an alternative option to build your vocabulary from a word frequencies dictionary. Word frequencies for example are usually used to filter low or high frequent words which are meaningless for your model.
Related
I am building a topic model from scratch, one step of which uses the TfidfVectorizer method to get unigrams and bigrams from my corpus of texts:
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.9, ngram_range = (1,2))
After topics are created, I use the similarity scores provided by gensim's Word2Vec to determine coherence of topics. I do this by training on the same corpus:
bigram_transformer = Phrases(corpus)
model = Word2Vec(bigram_transformer[corpus], min_count=1)
For many of the bigrams in my topics however, I get a KeyError because that bigram was not picked up in the training of Word2Vec, despite them being trained on the same corpus. I think this is because Word2Vec decides on which bigrams to choose based on statistical analysis (Why aren't all bigrams created in gensim's `Phrases` tool?)
Is there a way to get the Word2Vec to include all those bigrams identified by TfidfVectorizer? I see trimming capabilities such as 'trim_rule' but not anything in the other direction.
The point of the Phrases model in Gensim is to pick some bigrams, which are calculated to be statistically-significant.
If you then apply that model's determinations as a preprocessing step on your corpus, certain pairs of unigrams will be outright replaced in your text with the combined bigram. (As such, it's possible some unigrams that were there originally will no longer appear even once.)
Thus the concepts of bigrams as used by Gensim's Phrases and the TfidfVectorizer's ngram_range facility are different. Phrases is meant for destructive replacements where specific bigrams are inferred to be more interesting than the unigrams. TfidfVectorizer will add extra bigrams as additional dimensional features.
I suppose the right tuning of Phrases could cause it to consider every bigram as significant. Without checking, it looks like a super-tiny value, like 0.0000000001, might have essentially that effect. (The Phrases class will reject a value of 0 as nonsensical given its usual use.)
But at that point, your later transformation (via bigram_transformer[corpus]) will combine every possible pair of words before Word2Vec training. For example, the sentence:
['the', 'skittish', 'cat', 'jumped', 'over', 'the', 'gap',]
...would indiscriminately become...
['the_skittish', 'cat_jumped', 'over_the', 'gap',]
It seems unlikely that you want that, for a number of reasons:
There might then be no training texts with the 'cat' unigram alone, leaving you with no word-vector for that word at all.
Bigrams that are rare or of little grammatical value (like 'the_skittish') will receive trained word-vectors, & take up space in the model.
The kinds of text corpus that are large enough for good Word2Vec results might have far more bigrams than are manageable. (A corpus small enought that you can afford to track every bigram may be on the thin side for good Word2Vec results.)
Further, to perform that greedy-combination of all bigrams, the Phrases frequency-survey & calculations aren't even necessary. (It can be done automatically with no preparation/analysis.)
So, you shouldn't expect every bigram of TfidfVectorizer to be get a word-vector, unless you take some extra steps, outside the normal behavior of Phrases, to ensure every such bigram was in the training texts.
To try to do so wouldn't necessarily need Phrases at all, and might be unmanageable, and involve other tradeoffs. (For example, I could imagine repeating the corpus many times, only combining a fraction of the bigrams each time – so that each is sometimes surrounded by other unigrams, and sometimes by other bigrams – to create a synthetic corpus with enough meaningful texts to create all your desired vectors. But the logic & storage space for that model would be larger & complicated, and without prominent precedent, so it'd be a novel experiment.)
I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words.
I am trying to do a supervised learning with these documents.
My approach would be:
Vectorize each document in the corpus. Say we have 2347 docs.
I can have 2347 rows with labels viz. Like as 1 and Dislike as 0.
Using any ML classification supervised model train above dataset with 2347 rows.
How to vectorize and create such dataset?
One of the things you can try is using Doc2Vec. This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.
There are other alternatives to doc2vec mentioned here. Try the Average of Word2Vec vectors with TF-IDF approach as well.
Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are
The proportion of uppercase words
Slang words present or not
Emoticons present or not
Language of the text
The sentiment of the text - this is a whole new topic altogether though
I hope this was helpful...
I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:
http://dsgeek.com/2018/02/19/tfidf_vectors.html
https://www.aclweb.org/anthology/P16-1089
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.
For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.
But for fastText same word will have several n-grams,
"Listen to the latest news summary" will have n-grams generated by a sliding windows like:
lis ist ste ten tot het...
These n-grams are handled internally by the model so when I try:
model["Listen to the latest news summary"]
I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:
model['lis']
model['ist']
model['ten']
And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.
I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.
That is, you send
model["Listen"]
model["to"]
model["the"]
...
to FastText, and then use your old code to get the tf-idf weights.
In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.
You are talking about fasttext tokenization step (not fasttext embeddings) which is a (3,6) char-n-gram tokenization, compatible with tfidf. The full step can be computed outside of fasttext quite easily Calculate TF-IDF using sklearn for n-grams in python
For what I understood from your question you are confusing the difference between word embeddings methods (such as word2vec and many other) and Tf-Idf:
Basically Word Embeddings methods are unsupervised models for
generating word vectors. The word vectors generated by this kind of
models are now very popular in NPL tasks. This is because a word
embedding representation of a word captures more information about
a word than just a one-hot representation of the word, since the
former captures semantic similarity of that word to other words
whereas the latter representation of the word is equidistant from all
other words. FastText is another way to implements word embedding (recently opensourced by facebook researcher).
Tf-idf, instead is a scoring scheme for words, that is a measure of how
important a word is to a document.
From a practical usage standpoint, while tf-idf is a simple scoring scheme and that is its key advantage, word embeddings may be a better choice for most tasks where tf-idf is used, particularly when the task can benefit from the semantic similarity captured by word embeddings (e.g. in information retrieval tasks).
Unlike Word2Vec that learn a vector representation of the entire word, FastText learn a representation for each n-gram of the word as you already seen. So the overall word embeddings is the sum of the n-gram representation. Basically FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.
For my standpoint in general It does not make sense use FastText (or any word embeddings methods) together with Tf-Idf. But if you want use Tf-Idf with FastText you must sum all the n-gram that compose your word and use this representation to calculate the Tf-Idf.
How to get most frequent context words from pretrained fasttext model?
For example:
For word 'football' and corpus ["I like playing football with my friends"]
Get list of context words: ['playing', 'with','my','like']
I try to use
model_wiki = gensim.models.KeyedVectors.load_word2vec_format("wiki.ru.vec")
model.most_similar("блок")
But it's not satisfied for me
The plain model doesn't retain any such co-occurrence statistics from the original corpus. It just has the trained results: vectors per word.
So, the ranked list of most_similar() vectors – which isn't exactly words that appeared-together, but strongly correlates to that – is the best you'll get from that file.
Only going back to the original training corpus would give you exactly what you've requested.
I am new to python and NLTk. I have a model created for sentiment analysis of survey in NLTK (NaivesBayesCalssifier). To improve the accuracy, i wanted to add some dictionary containing list of positive and negative statements in the model. Is there any module in NLTK and are there any additional features that can improve my model?
You can have a look at some public sentiment lexicons which would provide you a corpus of positive and negative words.
One of them can be found at https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Since, you haven't specified any details about your current model, I'm assuming you are using a very basic Naive Bayes classifier. If you are using unigrams(words) to vectorize your text right now, then you can consider using bigrams or trigrams for generating the feature vectors.This would basically, enable you to use the contextual information of the words to a certain extent.
If you are currently using a bag of words model like Tfidf to convert your text to converts then you can consider using word embeddings instead of that. Bag of words doesn't consider the contextual information of the words whereas, word embeddings are able to capitalise on that.
You could use somethings like gensim which uses deep learning to convert words to vectors. Have a look at : https://radimrehurek.com/gensim/models/word2vec.html
Furthermore, you can always try using a linearSVC classifier or a logistic regression classifier and choose whichever one gives the best accuracy.
you can download one from NLTK,just like:
from nltk.corpus import opinion_lexicon
pos_list=set(opinion_lexicon.positive())
neg_list=set(opinion_lexicon.negative())