TfidfVectorizer - Normalisation bias - python

I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)

Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book

use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.

Related

How do I calculate the similarity of a word or couple of words compared to a document using a doc2vec model?

In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.
There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector() method. However, like Doc2Vec in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector(), so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1) does this, or PV-DBOW (dm=0) if you add the optional interleaved word-vector training (dbow_words=1). (If you use dm=0, dbow_words=0, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.

Gensim word2vec WMD similarity dictionary

I'm using word2vec on a 1 million abstracts dataset (2 billion words). To find most similar documents, I use the gensim.similarities.WmdSimilarity class. When trying to retrieve the best match using wmd_similarity_index[query], the calculation spends most of its time building a dictionary. Here is a piece of log:
2017-08-25 09:45:39,441 : INFO : built Dictionary(127 unique tokens: ['empirical', 'model', 'estimating', 'vertical', 'concentration']...) from 2 documents (total 175 corpus positions)
2017-08-25 09:45:39,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
What does this part ? Is it dependent on the query ? Is there a way to do these calculations once for all ?
EDIT: training and scoring phases in my code:
Training and saving to disk:
w2v_size = 300
word2vec = gensim.models.Word2Vec(texts, size=w2v_size, window=9, min_count=5, workers=1, sg=1, hs=1, iter=20) # sg=1 means skip gram is used 
word2vec.save(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
corpus_w2v_wmd_index = gensim.similarities.WmdSimilarity(texts, word2vec.wv)
corpus_w2v_wmd_index.save(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
Loading and scoring:
w2v = gensim.models.Word2Vec.load(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
words = [t for t in proc_text if t in w2v.wv]
corpus_w2v_wmd_index = gensim.similarities.docsim.Similarity.load(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
scores_w2v = np.array(corpus_w2v_wmd_index[words])
The "Word Mover's Distance" calculation is relatively expensive – for each pairwise document comparison, it searches for an optimal 'shifting' of semantic positions, and that shifting is itself dependent on the pairwise simple-distances between all words of each compared document.
That is, it involves far more calculation than a simple cosine-distance between two high-dimensional vectors, and it involves more calculation the longer the two documents are.
There isn't much that could be pre-calculated, from the texts corpus, until the query's words are known. (Each pairwise calculation depends on the query's words, and their simple-distances to each corpus document's words.)
That said, there are some optimizations the gensim WmdSimilarity class doesn't yet do.
The original WMD paper described a quicker calculation that could help eliminate corpus texts that couldn't possibly be in the top-N most-WMD-similar results. Theoretically, the gensim WmdSimilarity could also implement this optimization, and give quicker results, at least when initializing the WmdSimilarity with the num_best parameter. (Without it, every query returns all WMD-similarity-scores, so this optimization wouldn't help.)
Also, for now the WmdSimilarity class just calls KeyedVectors.wmdistance(doc1, doc2) for every query-to-corpus-document pair, as raw texts. Thus the pairwise simple-distances from all doc1 words to doc2 words will be recalculated each time, even if many pairs repeat across the corpus. (That is, if 'apple' is in the query and 'orange' is in every corpus doc, it will still calculate the 'apple'-to-'orange' distance repeatedly.)
So, some caching of interim values might help performance. For example, with a query of 1000 words, and a vocabulary of 100,000 words among all corpus documents, the ((1000 * 100,000) / 2) 50 million pairwise word-distances could be precalculated once, using 200MB, then shared by all subsequent WMD-calculations. To add this optimization would require a cooperative refactoring of WmdSimilarity.get_similarities() and KeyedVectors.wmdistance().
Finally, Word2Vec/Doc2Vec applications don't necessarily require or benefit much from stop-word removal or stemming. But because the expense of WMD calculation grows with document and vocabulary size, anything that shrinks effective document sizes could help performance. So various ways of discarding low-value words, or coalescing similar words, may be worth considering when using WMD on large document sets.

Why are almost all cosine similarities positive between word or document vectors in gensim doc2vec?

I have calculated document similarities using Doc2Vec.docvecs.similarity() in gensim. Now, I would either expect the cosine similarities to lie in the range [0.0, 1.0] if gensim used the absolute value of the cosine as the similarity metric, or roughly half of them to be negative if it does not.
However, what I am seeing is that some similarities are negative, but they are very rare – less than 1% of pairwise similarities in my set of 30000 documents.
Why are almost all of the similarities positive?
There's no inherent guarantee in Word2Vec/Doc2Vec that the generated set of vectors is symmetrically distributed around the origin point. They could be disproportionately in some directions, which would yield the results you've seen.
In a few tests I just did on the toy-sized dataset ('lee corpus') used in the bundled gensim docs/notebooks/doc2vec-lee.ipynb notebook, checking the cosine-similarities of all documents against the first document, it vaguely seems that:
using hierarchical-softmax rather than negative sampling (hs=1, negative=0) yields a balance between >0.0 and <0.0 cosine-similarities that is closer-to (but not yet quite) half and half
using a smaller number of negative samples (such as negative=1) yields a more balanced set of results; using a larger number (such as negative=10) yields relatively more >0.0 cosine-similarities
While not conclusive, this is mildly suggestive that the arrangement of vectors may be influenced by the negative parameter. Specifically, typical negative-sampling parameters, such as the default negative=5, mean words will be trained more times as non-targets, than as positive targets. That might push the preponderance of final coordinates in one direction. (More testing on larger datasets and modes, and more analysis of how the model setup could affect final vector positions, would be necessary to have more confidence in this idea.)
If for some reason you wanted a more balanced arrangement of vectors, you could consider transforming their positions, post-training.
There's an interesting recent paper in the word2vec space, "All-but-the-Top: Simple and Effective Postprocessing for Word Representations", that found sets of trained word-vectors don't necessarily have a 0-magnitude mean – they're on average in one direction from the origin. And further, this paper reports that subtracting the common mean (to 're-center' the set), and also removing a few other dominant directions, can improve the vectors' usefulness for certain tasks.
Intuitively, I suspect this 'all-but-the-top' transformation might serve to increase the discriminative 'contrast' in the resulting vectors.
A similar process might yield similar benefits for doc-vectors – and would likely make the full set of cosine-similarities, to any doc-vector, more balanced between >0.0 and <0.0 values.

Location of the words in text

NLTK package of Python has a function dispersion plot, which shows location of chosen words in text. If there any numeric measure of such dispersion that can be calculated in python? E.g. I want to measure weather the word "money" is spread among the text or rather concentrated in one chapter?
I believe there are multiple metrics that can be used to give a quantitative measure of what you are defining as informativeness of a word over a body of text.
Methodology
Since you mention chapter and text as the levels you wish to evaluate, the basic methodology would be the same:
Break a given text into chapters
Evaluate model on chapter and text level
Compare evaluation on chapter and text level
If the comparison is over a threshold you could claim it is meaningful or informative. Other metrics on the two levels could be used depending on the model.
Models
There are a few models that can be used.
Raw counts
Raw counts of words could be used on chapter and text levels. A threshold of percentage could be used to determine a topic as representative of the text.
For example, if num_word_per_chapter/num_all_words_per_chapter > threshold and/or num_word_per_text/num_all_words_text > threshold then you could claim it is representative. This might be a good baseline. It is essentially a bag-of-words like technique.
Vector Space Models
Vector space models are used in Information Retrieval and Distributional Semantics. They usually used sparse vectors of counts or TF-IDF. Two vectors are compared with cosine similarity. Closer vectors have smaller angles and are considered "more alike".
You could create chapter-term matrices and average cosine similarity metrics for a text body. If the average_cos_sim > threshold you could claim it is more informative of the topic.
Examples and Difficulties
Here is a good example of VSM with NLTK. This may be a good place to start for a few tests.
The difficulties I foresee are:
Chapter Splitting
Finding Informative Threshold
I can't give you a more practical code based answer at this time, but I hope this gives you some options to start with.

(Text Classification) Handling same words but from different documents [TFIDF ]

So I'm making a python class which calculates the tfidf weight of each word in a document. Now in my dataset I have 50 documents. In these documents many words intersect, thus having multiple same word features but with different tfidf weight. So the question is how do I sum up all the weights into one singular weight?
First, let's get some terminology clear. A term is a word-like unit in a corpus. A token is a term at a particular location in a particular document. There can be multiple tokens that use the same term. For example, in my answer, there are many tokens that use the term "the". But there is only one term for "the".
I think you are a little bit confused. TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF converts a document into a mapping of terms to weights. So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. There is no separate score for tokens sharing a term inside the doc.

Categories