Extracting only meaningful text from webpages - python

I am getting a list of urls and scraping them using nltk. My end result is in the form of a list with all the words on the webpage in a list. The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically?
I am essentially looking for useful words on a page that are not fluff and can give some context to what the page is about. Almost like the tags on stackoverflow or the tags google uses for seo.

I think what you are looking for is the stopwords.words from nltk.corpus:
>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']
Edit: searching for stopword give possible duplicates: Stopword removal with NLTK, How to remove stop words using nltk or python. See the answers of these question. And consider Effects of Stemming on the term frequency? too

While you might get robust lists of stop-words in NLTK (and elsewhere), you can easily build your own lists according to the kind of data (register) you process. Most of the words you do not want are so-called grammatical words: they are extremely frequent, so you catch them easily by sorting a frequency list by descending order and discarding the n-top items.
In my experience, the first 100 ranks of any moderately large corpus (>10k tokens of running text) hardly contain any content words.
It seems that you are interested in extracting keywords, however. For this task, pure frequency signatures are not very useful. You will need to transform the frequencies into some other value with respect to a reference corpus: this is called weighting and there are many different ways to achieve it. TfIdf is the industry standard since 1972.
If you are going to spend time doing these tasks, get an introductory handbook for corpus linguistics or computational linguistics.

You can look for available corpora linquistics for data on frequency of words (along with other annotations).
You can start from links on wikipedia: http://en.wikipedia.org/wiki/Corpus_linguistics#External_links
More information you can probably find at https://linguistics.stackexchange.com/

Related

Would it be very inefficient to have a repository of words to check a single language against?

I am doing some NLP with Python on YouTube comments I have downloaded, and I only want to process English ones. So far I have experimented with different libraries (many of the ones discussed in this thread) and it works fine for longer strings, but many of the libraries often run into problems with the shorter, one or two worders. My question is whether it would be hopelessly inefficient to download a dictionary of English words and check each of these short, problematic comments against it, obviously discarding the ones that don't match.
I can forsee problems with things such as misspellings or words that appear in English and a foreign language, but at present I am more concerned about speed as I have about 68 million comments to process.
Try using NLTK's corpus. Nltk is an external module in python with multiple corpus for natural language processing. Specifically, what interests you is the following:
from nltk.corpus import words
eng_words = words.words("en")
Words.words("en") is a list containing almost 236,000 English words. By converting this into a set would really speed up your word processing. You could test your words against this corpus, and if they exist it means they are English words:
string = "I loved stack overflow so much. Mary had a little lamb"
set_words = set(words.words("en"))
for word in string.split():
if word in set_words:
print(word)
Output
I loved stack overflow so much. Mary had a little lamb
If it is a dictionary you are looking for (with proper definitions), I have used #Tushars implementation. It is neatly made and is available for everyone. The format used is:
{WORD: {'MEANINGS':{} , 'ANTONYMS':[...] , 'SYNONYMS':[...]}}
and the 'MEANINGS' dict is arranged as
'MEANINGS':{sense_num_1:[TYPE_1, MEANING_1, CONTEXT_1, EXAMPLES], sense_num_2:[TYPE_2, MEANING_2, CONTEXT_2, EXAMPLES] and so on...}
The file is available here: https://www.dropbox.com/s/qjdgnf6npiqymgs/data.7z?dl=1
More details can be found here: English JSON Dictionary with word, word type and definition

How do I discover list of words from corpus which distinguish from another corpus? Python

I have two lists of unstructured text input, and I want to find the words that distinguish listA from listB.
For example, if listA were the text of "Harry Potter" and listB were the text of "Ender's Game", the distinguishes elements for listA would be [wand, magic, wizard, . . .] and the distinguishing elements for listB would be [ender, buggers, battle, . . .]
I've tried a bit with the python-nltk module, and am able to easily find the most common words in each list, but that is not exactly what I'm after.
I've tried a bit with the python-nltk, and am able to easily find the most common words in each list, but not exactly what I'm after
I'm guessing what you mean by this is it's coming up with words like "and", "the", "of", etc. as the words with the highest frequency. These words aren't very helpful, they are basically just the glue that holds words together to form a sentence, you could remove them but you would need a list of "useless" words called a stoplist, nltk has such a list from nltk.corpus import stop words.
You might want to take a look at TF.IDF scoring. This will give a higher weight to the words that are common in one document but uncommon in general. Usually you would use a large corpora to calculate which words are common in general.
You can use synsets to get it done. To get synsets NLTK include a very powerful library called wordnet.
Wordnet is a big 'database' (in lack of a better word) of human language, not only english, it supports many other languages.
Synset is a like a similar idea you get when you hear a term. Almost like a synonym, but not that strict. Please, go to the link, its a better definition.
Synset Closures is what can help you the most. For example, 'bee' is a animial, a insect, a living thing; Harry Potter is fictional, human, wizard.
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
hyper = lambda s: s.hypernyms()
list(dog.closure(hyper))
Heres a book that teach you the surface of nltk, is not very good but is a good place to start along with NTLK HOWTOs
If you want something deeper I cant help you, I dont know most of the definitions and functions NTLK provide us, but synsets are a great place to start.

Implementing idf with nltk

Given the sentence: "the quick brown fox jumped over the lazy dog", I would like to get a score of how frequent each word is from an nltk corpus (which ever corpus is most generic/comprehensive)
EDIT:
This question is in relation to this question: python nltk keyword extraction from sentence where #adi92 suggested using the technique of idf to calculate the 'rareness' of a word. I would like to see what this would look like in practice. The broader problem here is, how do you calculate the rareness of a word's use in the english language. I appreciate that this is a hard problem to solve, but nonetheless nltk idf (with something like the brown or reuters corpus??) might get us part of the way there?
If you want to know word frequencies you need a table of word frequencies. Words have different frequencies depending on text genre, so the best frequency table might be based on a domain-specific corpus.
If you're just messing around, it's easy enough to pick a corpus at random and count the words-- use <corpus>.words() and the nltk's FreqDist, and/or see the NLTK book for details.
But for serious use, don't bother counting words yourself: If you're not interested in a specific domain, grab a large word frequency table. There are gazillions out there (it's evidently the first thing a corpus creator thinks of), and the largest one is probably the "1-gram" tables compiled by google. You can download them at http://books.google.com/ngrams/datasets

How to use Parts-of-Speech to evaluate semantic text similarity?

I'm trying to write a program to evaluate semantic similarity between texts. I have already compared n-gram frequencies between texts (a lexical measure). I wanted something a bit less shallow than this, and I figured that looking at similarity in sentence construction would be one way to evaluate text similarity.
However, all I can figure out how to do is to count the POS (for example, 4 nouns per text, 2 verbs, etc.). This is then similar to just counting n-grams (and actually works less well than the ngrams).
postags = nltk.pos_tag(tokens)
self.pos_freq_dist = Counter(tag for word,tag in postags)
for pos, freq in self.pos_freq_dist.iteritems():
self.pos_freq_dist_relative[pos] = freq/self.token_count #normalise pos freq by token counts
Lots of people (Pearsons, ETS Research, IBM, academics, etc.) use Parts-of-Speech for deeper measures, but no one says how they have done it. How can Parts-of-Speech be used for a 'deeper' measure of semantic text similarity?
A more sophisticated tagger is required such as http://phpir.com/part-of-speech-tagging/.
You will need to write algorithms and create word banks to determine the meaning or intention of sentences. Semantic analysis is artificial intelligence.
Nouns and capitalized nouns will be the subjects of the content. Adjectives will give some hint as to the polarity of the content. Vagueness, clarity, power, weakness, the types of words used. The possibilities are endless.
Take a look at chapter 6 of the NLTK Book. It should give you plenty of ideas for features you can use to classify text.

Using WordNet to determine semantic similarity between two texts?

How can you determine the semantic similarity between two texts in python using WordNet?
The obvious preproccessing would be removing stop words and stemming, but then what?
The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?
One thing that you can do is:
Kill the stop words
Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.
There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).
Hope this helps

Categories