Implementing idf with nltk - python

Given the sentence: "the quick brown fox jumped over the lazy dog", I would like to get a score of how frequent each word is from an nltk corpus (which ever corpus is most generic/comprehensive)
EDIT:
This question is in relation to this question: python nltk keyword extraction from sentence where #adi92 suggested using the technique of idf to calculate the 'rareness' of a word. I would like to see what this would look like in practice. The broader problem here is, how do you calculate the rareness of a word's use in the english language. I appreciate that this is a hard problem to solve, but nonetheless nltk idf (with something like the brown or reuters corpus??) might get us part of the way there?

If you want to know word frequencies you need a table of word frequencies. Words have different frequencies depending on text genre, so the best frequency table might be based on a domain-specific corpus.
If you're just messing around, it's easy enough to pick a corpus at random and count the words-- use <corpus>.words() and the nltk's FreqDist, and/or see the NLTK book for details.
But for serious use, don't bother counting words yourself: If you're not interested in a specific domain, grab a large word frequency table. There are gazillions out there (it's evidently the first thing a corpus creator thinks of), and the largest one is probably the "1-gram" tables compiled by google. You can download them at http://books.google.com/ngrams/datasets

Related

How to find text reuse with fuzzy match?

I try to find, effectively, a similarity between a short phrase and a large corpus. For example, suppose my corpus is the book Moby Dick. This book has tens of thousands of words.
In addition to that, I have a few short phrases. for example:
phrase1 = "Call me Ishmael" # This is the first sentence in the book exactly.
phrase2 = "Call me Isabel" # This is like the previous with changes of few letters from the third word.
phrase3 = "Call me Is mael" #It's a similar sentence but one word split in two.
In addition, I have of course many other sentences that are not similar to sentences from the book.
I wonder what is the generic and effective way to identify sentences that have a sentence similar to them in the book.
What have I tried to do that seems less appropriate to me?
I split all the input sentences into 3/4/5/6 n-grams.
I split all the corpus sentences into 3/4/5/6 n-grams.
Then I tried to find an approximate match (with FuzzySet) between all possible combinations of corpus n-grams and input n-grams (The combinations are required to grasp even cases where words have split or merged.)
It is quite clear to me that this method is very wasteful and probably also not the most accurate for my needs. I would love to understand how best to do it.
You can use corpus-based spell correction followed by fuzzyset. For spell correction, you can use a python implementation of symspell algorithm. Here you can find the list of repository implementing symspell. Use symspell with compounding configuration.
Use a speller trained on the corpus to spell correct short sentences.
Use fuzzyset/fuzzywuzzy to find a similarity score between spell
corrected sentence and each sentence in the corpus.
Define a threshold by experiment, if the similarity is above the
threshold,call it a match

Word tokenizer for Python implementation of hedonometer

I'm studying this paper "The emotional arcs of stories are dominated by six basic shapes", in which a hedonometer (previosly used mainly for sentiment analysis on Twitter) is applied to a book to make an emotional arc.
I'm trying to reproduce the results of such paper in Python, but, while I understand the hedonometer algorithm, I can't understand the way words were tokenized (the paper is really poor on this matter, and I'm not really confident about it).
In the link above, at the bottom, there is the Electronic Supplementary Material in which, at page S8, it is claimed that the book "The Picture of Dorian Gray" (after removing the front and back matter) has 84,591 words.
I tried two tokenizers (suppose the text is in a variable text).
The first is a naive one
words = text.split()
and gets only 78,934 words.
The second one, using a well-know library (the code might by sub-optimal, since I've never used NLTK before neither done anything related to Natural Language Processing)
from nltk.tokenize import sent_tokenize, word_tokenize
words = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
I get an incredible amount of 95,285 words! If I filter out the punctuation with the following code
words2 = [word for word in words if word not in ',-.:\'\"!?;()[]']
it reduces to 83,141 words.
Yet, nothing really close to the paper quantity of 84,591 words. What am I doing wrong? Am I not understading how a tokenizer works?
P.S. I've tried with other books, but the results are similar. I always either underestimate or overestimate the words count by 2000 words.
Bonus question: as the paper points out, there is a dictionary of words with associated happiness score to be used (to assign averaged happiness score at chunks of text). In such dictionary there are also words such as "can't", while the NLTK tokenizer splits "can't" into "ca" and "n't". Also, looking at the tokenized text on page S5 of the Electronic Supplementary Material I assume that the NLTK behaviour is the desired one. Yet, shouldn't it work otherwise, i.e. leaving "can't" united, by the way the dictionary is built?

Extracting only meaningful text from webpages

I am getting a list of urls and scraping them using nltk. My end result is in the form of a list with all the words on the webpage in a list. The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically?
I am essentially looking for useful words on a page that are not fluff and can give some context to what the page is about. Almost like the tags on stackoverflow or the tags google uses for seo.
I think what you are looking for is the stopwords.words from nltk.corpus:
>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']
Edit: searching for stopword give possible duplicates: Stopword removal with NLTK, How to remove stop words using nltk or python. See the answers of these question. And consider Effects of Stemming on the term frequency? too
While you might get robust lists of stop-words in NLTK (and elsewhere), you can easily build your own lists according to the kind of data (register) you process. Most of the words you do not want are so-called grammatical words: they are extremely frequent, so you catch them easily by sorting a frequency list by descending order and discarding the n-top items.
In my experience, the first 100 ranks of any moderately large corpus (>10k tokens of running text) hardly contain any content words.
It seems that you are interested in extracting keywords, however. For this task, pure frequency signatures are not very useful. You will need to transform the frequencies into some other value with respect to a reference corpus: this is called weighting and there are many different ways to achieve it. TfIdf is the industry standard since 1972.
If you are going to spend time doing these tasks, get an introductory handbook for corpus linguistics or computational linguistics.
You can look for available corpora linquistics for data on frequency of words (along with other annotations).
You can start from links on wikipedia: http://en.wikipedia.org/wiki/Corpus_linguistics#External_links
More information you can probably find at https://linguistics.stackexchange.com/

How to use Parts-of-Speech to evaluate semantic text similarity?

I'm trying to write a program to evaluate semantic similarity between texts. I have already compared n-gram frequencies between texts (a lexical measure). I wanted something a bit less shallow than this, and I figured that looking at similarity in sentence construction would be one way to evaluate text similarity.
However, all I can figure out how to do is to count the POS (for example, 4 nouns per text, 2 verbs, etc.). This is then similar to just counting n-grams (and actually works less well than the ngrams).
postags = nltk.pos_tag(tokens)
self.pos_freq_dist = Counter(tag for word,tag in postags)
for pos, freq in self.pos_freq_dist.iteritems():
self.pos_freq_dist_relative[pos] = freq/self.token_count #normalise pos freq by token counts
Lots of people (Pearsons, ETS Research, IBM, academics, etc.) use Parts-of-Speech for deeper measures, but no one says how they have done it. How can Parts-of-Speech be used for a 'deeper' measure of semantic text similarity?
A more sophisticated tagger is required such as http://phpir.com/part-of-speech-tagging/.
You will need to write algorithms and create word banks to determine the meaning or intention of sentences. Semantic analysis is artificial intelligence.
Nouns and capitalized nouns will be the subjects of the content. Adjectives will give some hint as to the polarity of the content. Vagueness, clarity, power, weakness, the types of words used. The possibilities are endless.
Take a look at chapter 6 of the NLTK Book. It should give you plenty of ideas for features you can use to classify text.

Using WordNet to determine semantic similarity between two texts?

How can you determine the semantic similarity between two texts in python using WordNet?
The obvious preproccessing would be removing stop words and stemming, but then what?
The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?
One thing that you can do is:
Kill the stop words
Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.
There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).
Hope this helps

Categories