So I'm making a python class which calculates the tfidf weight of each word in a document. Now in my dataset I have 50 documents. In these documents many words intersect, thus having multiple same word features but with different tfidf weight. So the question is how do I sum up all the weights into one singular weight?
First, let's get some terminology clear. A term is a word-like unit in a corpus. A token is a term at a particular location in a particular document. There can be multiple tokens that use the same term. For example, in my answer, there are many tokens that use the term "the". But there is only one term for "the".
I think you are a little bit confused. TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF converts a document into a mapping of terms to weights. So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. There is no separate score for tokens sharing a term inside the doc.
Related
I am trying to get unique words for each topic.
I am using gensim and this is the line that help me to generate my model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary)
But I have repeated words in two different topics, I would like to have different words per topic
You cannot enforce words uniqueness by topic in LDA since each topic is a distribution over all the words in the vocabulary. This distribution measure the probability that words co-occur inside a topic. Thus, nothing ensures that a word won't co-occur with different words in different contexts, which will leads to words represented in different topics.
Let's take an example by considering these two documents:
doc1: The python is a beautiful snake living in the forest.
doc2: Python is a beautiful language used by programmer and data scientist.
In doc1 the word python co-occur with snake, forest and living which might give a good probability for this word to appear in a topic, let's say, about biology.
In doc2, the word python co-occur with language, programmer and data which, in this case, will associate this word in a topic about computer science.
What you can eventually do, is to look for words that have the highest probability in topics in order to achieve what you want.
Words that are grouped into one topic do not mean that they are semantically similar(low distance in space mapped from word2vec). They are just co-occurred more often.
I'm trying to use scikit applied to Natural Language Processing and I'm starting by reading some tutorials. I've found this one http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/ which explains how to get tfidf scores from a set of documents.
But I have a question, TF-IDF is supposed to depend from a term, the document of that term and the collection of all documents to be analyzed.
So, for example. In a collection of two documents, A and B, the term 'horse' should get a different TF-IDF score if we compute TF-IDF using document A than the same term but by analyzing term frequency from document B.
How can I compute TF-IDF of a term in respect of a specific document using scikit?
In tutorial wich you mentioned TF-IDF is calculated as:
tfidf_matrix = tf.fit_transform(corpus)
Quote: "if we look at tfidf_matrix we’d expect it to be a 208 x 498254 matrix – one row per episode, one column per phrase".
So, TF-IDF of each phrase is different for each episode (text) in this matrix. As you expected.
Matrix element tfidf_matrix[document,phrase] is TF-IDF value for each particular phrase in particular document of a corpus (all documents).
NLTK package of Python has a function dispersion plot, which shows location of chosen words in text. If there any numeric measure of such dispersion that can be calculated in python? E.g. I want to measure weather the word "money" is spread among the text or rather concentrated in one chapter?
I believe there are multiple metrics that can be used to give a quantitative measure of what you are defining as informativeness of a word over a body of text.
Methodology
Since you mention chapter and text as the levels you wish to evaluate, the basic methodology would be the same:
Break a given text into chapters
Evaluate model on chapter and text level
Compare evaluation on chapter and text level
If the comparison is over a threshold you could claim it is meaningful or informative. Other metrics on the two levels could be used depending on the model.
Models
There are a few models that can be used.
Raw counts
Raw counts of words could be used on chapter and text levels. A threshold of percentage could be used to determine a topic as representative of the text.
For example, if num_word_per_chapter/num_all_words_per_chapter > threshold and/or num_word_per_text/num_all_words_text > threshold then you could claim it is representative. This might be a good baseline. It is essentially a bag-of-words like technique.
Vector Space Models
Vector space models are used in Information Retrieval and Distributional Semantics. They usually used sparse vectors of counts or TF-IDF. Two vectors are compared with cosine similarity. Closer vectors have smaller angles and are considered "more alike".
You could create chapter-term matrices and average cosine similarity metrics for a text body. If the average_cos_sim > threshold you could claim it is more informative of the topic.
Examples and Difficulties
Here is a good example of VSM with NLTK. This may be a good place to start for a few tests.
The difficulties I foresee are:
Chapter Splitting
Finding Informative Threshold
I can't give you a more practical code based answer at this time, but I hope this gives you some options to start with.
I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)
Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book
use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.
I have list of words, I want to calculate the relatedness of two words by considering their co-occurrences. From a paper I found that it can be calculated using Pearsson chi-square test. Also I found nltk.BigramAssocMeasures.ch_sq() for calculating chi-sqare value.
Can I use this for my needs? How can I find chi-square value using nltk?
Have a look at this blog from Streamhacker, it gives a good explanation with code examples.
One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.