search similar meaning phrases with nltk - python

I have a bunch of unrelated paragraphs, and I need to traverse them to find similar occurrences such as that, given a search where I look for object falls, I find a boolean True for text containing:
Box fell from shelf
Bulb shattered on the ground
A piece of plaster fell from the ceiling
And False for:
The blame fell on Sarah
The temperature fell abruptly
I am able to use nltk to tokenise, tag and get Wordnet synsets, but I am finding it hard to figure out how to fit nltk's moving parts together to achieve the desired result. Should I chunk before looking for synsets? Should I write a context-free grammar? Is there a best practice when translating from treebank tags to Wordnet grammar tags? None of this is explained in the nltk book, and I couldn't find it on the nltk cookbook yet.
Bonus points for answers that include pandas in the answer.
[ EDIT ]:
Some code to get things started
In [1]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
def tag(x):
return pos_tag(word_tokenize(x))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly']
ser = Series(phrases)
ser.map(tag)
Out[1]:
0 [(Box, NNP), (fell, VBD), (from, IN), (shelf, ...
1 [(Bulb, NNP), (shattered, VBD), (on, IN), (the...
2 [(A, DT), (piece, NN), (of, IN), (plaster, NN)...
3 [(The, DT), (blame, NN), (fell, VBD), (on, IN)...
4 [(Berlin, NNP), (fell, VBD), (on, IN), (May, N...
5 [(The, DT), (temperature, NN), (fell, VBD), (a...
dtype: object

The way I would do it is the following:
Use nltk to find nouns followed by one or two verbs. In order to match your exact specifications I would use Wordnet:
The only nouns (NN, NNP, PRP, NNS) that should be found are the ones that are in a semantic relation with "physical" or "material" and the only verbs (VB, VBZ, VBD, etc...) that should be found are the ones that are in a semantic relation with "fall".
I mentioned "one or two verbs" because a verb can be preceded by an auxiliary. What you could also do is create a dependency tree to spot subject-verb relations, but it does not seem to be necessary in this case.
You might also want to make sure you exclude location names and keep person names (Because you would accept "John has fallen" but not "Berlin has fallen"). This can also be done with Wordnet, locations have the tag 'noun.location'.
I am not sure in which context you would have to convert the tags so I cannot provide a proper answer to that, in seems to me that you might not need that in this case: You use the POS tags to identify nouns and verbs and then you check if each noun and verb belong to a synset.
Hope this helps.

Not perfect, but most of the work is there. Now on to hardcoding pronouns (such as 'it') and closed-class words and adding multiple targets to handle things like 'shattered'. Not a single-liner, but not an impossible task!
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series, DataFrame
import collections
from nltk import wordnet
wn = wordnet.wordnet
def tag(x):
return pos_tag(word_tokenize(x))
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
for sub in flatten(el):
yield sub
else:
yield el
def noun_verb_match(phrase, nouns, verbs):
res = []
for i in range(len(phrase) -1):
if (phrase[i][1] in nouns) &\
(phrase[i + 1][1] in verbs):
res.append((phrase[i], phrase[i + 1]))
return res
def hypernym_paths(word, pos):
res = [x.hypernym_paths() for x in wn.synsets(word, pos)]
return set(flatten(res))
def bool_syn(double, noun_syn, verb_syn):
"""
Returns boolean if noun/verb double contains the target Wordnet Synsets.
Arguments:
double: ((noun, tag), (verb, tag))
noun_syn, verb_syn: Wordnet Synset string (i.e., 'travel.v.01')
"""
noun = double[0][0]
verb = double[1][0]
noun_bool = wn.synset(noun_syn) in hypernym_paths(noun, 'n')
verb_bool = wn.synset(verb_syn) in hypernym_paths(verb, 'v')
return noun_bool & verb_bool
def bool_loop(l, f):
"""
Tests all list elements for truthiness and
returns True if any is True.
Arguments:
l: List.
e: List element.
f: Function returning boolean.
"""
if len(l) == 0:
return False
else:
return f(l[0]) | bool_loop(l[1:], f)
def bool_noun_verb(series, nouns, verbs, noun_synset_target, verb_synset_target):
tagged = series.map(tag)
nvm = lambda x: noun_verb_match(x, nouns, verbs)
matches = tagged.apply(nvm)
bs = lambda x: bool_syn(x, noun_synset_target, verb_synset_target)
return matches.apply(lambda x: bool_loop(x, bs))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly',
'It fell on the floor']
nouns = "NN NNP PRP NNS".split()
verbs = "VB VBD VBZ".split()
noun_synset_target = 'artifact.n.01'
verb_synset_target = 'travel.v.01'
df = DataFrame()
df['text'] = Series(phrases)
df['fall'] = bool_noun_verb(df.text, nouns, verbs, noun_synset_target, verb_synset_target)
df

Related

NER - how to check if a common noun indicates a place (subcategorization)

I am looking for a way to find, in a sentence, if a common noun refers to places. This is easy for proper nouns, but I didn't find any straightforward solution for common nouns.
For example, given the sentence "After a violent and brutal attack, a group of college students travel into the countryside to find refuge from the town they fled, but soon discover that the small village is also home to a coven of serial killers" I would like to mark the following nouns as referred to places: countryside, town, small village, home.
Here is the code I'm using:
import spacy
nlp = spacy.load('en_core_web_lg')
# Process whole documents
text = ("After a violent and brutal attack, a group of college students travel into the countryside to find refuge from the town they fled, but soon discover that the small village is also home to a coven of satanic serial killers")
doc = nlp(text)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
Which gives as output the following:
Noun phrases: ['a violent and brutal attack', 'a group', 'college students', 'the countryside', 'refuge', 'the town', 'they', 'the small village', 'a coven', 'serial killers']
Verbs: ['travel', 'find', 'flee', 'discover']
You can use WordNet for this.
from nltk.corpus import wordnet as wn
loc = wn.synsets("location")[0]
def is_location(candidate):
for ss in wn.synsets(candidate):
# only get those where the synset matches exactly
name = ss.name().split(".", 1)[0]
if name != candidate:
continue
hit = loc.lowest_common_hypernyms(ss)
if hit and hit[0] == loc:
return True
return False
# true things
for word in ("countryside", "town", "village", "home"):
print(is_location(word), word, sep="\t")
# false things
for word in ("cat", "dog", "fish", "cabbage", "knife"):
print(is_location(word), word, sep="\t")
Note that sometimes the synsets are wonky, so be sure to double-check everything.
Also, for things like "small village", you'll have to pull out the head noun, but it'll just be the last word.

Wiktionary Parser for Python 3.6_ only for definitions

I am trying to parse the definitions of target English word from "en.Wiktionary.org"
I had considered already existing module(https://github.com/Suyash458/WiktionaryParser/blob/master/readme.md) ,however, it parses redundancies to my purpose - such as etymology, related words, pronounciation and examples.
How could I only parse the definitions according to the Part of Speech?
Any recommendation or advice would be grateful.
Is this what you mean?
>>> from wiktionaryparser import WiktionaryParser
>>> parser = WiktionaryParser()
>>> word = parser.fetch('satiate', 'english')
>>> for item in word[0]['definitions']:
... item['partOfSpeech'], item['text']
...
('verb', 'satiate (third-person singular simple present satiates, present participle satiating, simple past and past participle satiated)\n(transitive) To fill to satisfaction; to satisfy.Nothing seemed to satiate her desire for knowledge.\n(transitive) To satisfy to excess. To fill to satiety.\n')
('adjective', "satiate (comparative more satiate, superlative most satiate)\nFilled to satisfaction or to excess.Alexander PopeOur generals now, retir'd to their estates,Hang their old trophies o'er the garden gates;In life's cool evening satiate of applause […]\nAlexander PopeOur generals now, retir'd to their estates,Hang their old trophies o'er the garden gates;In life's cool evening satiate of applause […]\n")
>>> word = parser.fetch('arrondissement', 'french')
>>> for item in word[0]['definitions']:
... item['partOfSpeech'], item['text']
...
('noun', 'arrondissement\xa0m (plural arrondissements)\nArrondissement\n(Canada) Arrondissement, a borough (submunicipal administrative division)\n')
When you ask for a word this library returns a somewhat complicated structure of lists and dictionaries. You might just need for practice in manipulating them.

Missing words in NLTK vocabulary - Python

I was testing the NLTK package's vocabulary. I used the following code and was hoping to see all True.
import nltk
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
print ('answered' in english_vocab)
print ('unanswered' in english_vocab)
print ('altered' in english_vocab)
print ('alter' in english_vocab)
print ('looks' in english_vocab)
print ('look' in english_vocab)
But my results are as follows, so many words are missing, or rather some forms of the word are missing? Am I missing something?
False
True
False
True
False
True
Indeed, the corpus is not an exhaustive list of all the english words, but rather a collection of texts. A more appropriate way of telling if a word is a valid english word is to use wordnet:
from nltk.corpus import wordnet as wn
print wn.synsets('answered')
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')]
print wn.synsets('unanswered')
# [Synset('unanswered.s.01')]
print wn.synsets('notaword')
# []
NLTK corpora do not actually store every word, they are defined as "a large body of text".
For example, you were using the words corpus, and we can check its definition by using its readme() method:
>>> print(nltk.corpus.words.readme())
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
Unix's words is not exhaustive, so it may indeed be missing some words. Corpora are, by their nature, incomplete (hence the emphasis on natural language).
That being said, you might want to try using a corpus that is derived from a dictionary, like brown:
>>> print(nltk.corpus.brown.readme())
BROWN CORPUS
A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.
by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA
Revised 1971, Revised and Amplified 1979
http://www.hit.uib.no/icame/brown/bcm.html
Distributed with the permission of the copyright holder, redistribution permitted.

Identify the word as a noun, verb or adjective

Given a single word such as "table", I want to identify what it is most commonly used as, whether its most common usage is noun, verb or adjective. I want to do this in python. Is there anything else besides wordnet too? I don't prefer wordnet. Or, if I use wordnet, how would I do it exactly with it?
import nltk
text = 'This is a table. We should table this offer. The table is in the center.'
text = nltk.word_tokenize(text)
result = nltk.pos_tag(text)
result = [i for i in result if i[0].lower() == 'table']
print(result) # [('table', 'JJ'), ('table', 'VB'), ('table', 'NN')]
If you have a word out of context and want to know its most common use, you could look at someone else's frequency table (e.g. WordNet), or you can do your own counts: Just find a tagged corpus that's large enough for your purposes, and count its instances. If you want to use a free corpus, the NLTK includes the Brown corpus (1 million words). The NLTK also provides methods for working with larger, non-free corpora (e.g, the British National Corpus).
import nltk
from nltk.corpus import brown
table = nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() == 'table')
print(table.most_common())
[('NN', 147), ('NN-TL', 50), ('VB', 1)]

How to find collocations in text, python

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !

Categories