I am trying to parse the definitions of target English word from "en.Wiktionary.org"
I had considered already existing module(https://github.com/Suyash458/WiktionaryParser/blob/master/readme.md) ,however, it parses redundancies to my purpose - such as etymology, related words, pronounciation and examples.
How could I only parse the definitions according to the Part of Speech?
Any recommendation or advice would be grateful.
Is this what you mean?
>>> from wiktionaryparser import WiktionaryParser
>>> parser = WiktionaryParser()
>>> word = parser.fetch('satiate', 'english')
>>> for item in word[0]['definitions']:
... item['partOfSpeech'], item['text']
...
('verb', 'satiate (third-person singular simple present satiates, present participle satiating, simple past and past participle satiated)\n(transitive) To fill to satisfaction; to satisfy.Nothing seemed to satiate her desire for knowledge.\n(transitive) To satisfy to excess. To fill to satiety.\n')
('adjective', "satiate (comparative more satiate, superlative most satiate)\nFilled to satisfaction or to excess.Alexander PopeOur generals now, retir'd to their estates,Hang their old trophies o'er the garden gates;In life's cool evening satiate of applause […]\nAlexander PopeOur generals now, retir'd to their estates,Hang their old trophies o'er the garden gates;In life's cool evening satiate of applause […]\n")
>>> word = parser.fetch('arrondissement', 'french')
>>> for item in word[0]['definitions']:
... item['partOfSpeech'], item['text']
...
('noun', 'arrondissement\xa0m (plural arrondissements)\nArrondissement\n(Canada) Arrondissement, a borough (submunicipal administrative division)\n')
When you ask for a word this library returns a somewhat complicated structure of lists and dictionaries. You might just need for practice in manipulating them.
Related
I have the following sentences:
sent_1 = 'The cat caught the mouse.'
sent_2 = 'The cat caught and killed the mouse.'
Now I want to know who did what to whoom. Spacy's noun_chunks work perfectly in the first case, indicating "The cat" as the "nsubj" with the chunk.root.head.text being "caught". Likewise, "the mouse" is correctly classified as being the "dobj" with again "caught" as chunk.root.head.text. So it is easy to match these two.
However, in the second case, the nsubj gets "caught" as its chunk.root.head.text while the dobj gets "killed", whereas they actually would belong together. Is there a way to account for this kind of cases?
In the second case 'killed' is the head of the 'the mouse' as it is the text connecting the noun chunk to the rest of the phrase. From the spacy documentation:
Root text: The original text of the word connecting the noun chunk to the rest of the parse.
https://spacy.io/usage/linguistic-features#noun-chunks
N.b. that link has a very similar example to yours - a sentence with multiple noun chunks with different roots. ('Autonomous cars shift insurance liability toward manufacturers')
To answer your question, if you want 'caught' to be found as the head in both instances, then really what you're asking for is to recursively check the head of the tree for each noun_chunk... something like this:
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cat caught and killed the mouse.')
[x.root.head.head for x in doc.noun_chunks]
which avails:
[caught, caught]
N.b, this works for your example but if you needed to handle arbitrary sentences then you'd need to do something a bit more sophisticated, i.e. actually recursing the tree. e.g.
def get_head(x):
return x.head if x.head.head == x.head else get_head(x.head)
resulting in:
doc2 = nlp("Autonomous cars shift insurance liability toward manufacturers away from everyday users") # adapted from the spacy example with an additional NC 'everyday users' added
In [17]: [get_head(x.root.head) for x in doc.noun_chunks]
In [187]: [caught, caught]
In [18]: [get_head(x.root.head) for x in doc2.noun_chunks]
Out[18]: [shift, shift, shift, shift]
I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.
I was testing the NLTK package's vocabulary. I used the following code and was hoping to see all True.
import nltk
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
print ('answered' in english_vocab)
print ('unanswered' in english_vocab)
print ('altered' in english_vocab)
print ('alter' in english_vocab)
print ('looks' in english_vocab)
print ('look' in english_vocab)
But my results are as follows, so many words are missing, or rather some forms of the word are missing? Am I missing something?
False
True
False
True
False
True
Indeed, the corpus is not an exhaustive list of all the english words, but rather a collection of texts. A more appropriate way of telling if a word is a valid english word is to use wordnet:
from nltk.corpus import wordnet as wn
print wn.synsets('answered')
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')]
print wn.synsets('unanswered')
# [Synset('unanswered.s.01')]
print wn.synsets('notaword')
# []
NLTK corpora do not actually store every word, they are defined as "a large body of text".
For example, you were using the words corpus, and we can check its definition by using its readme() method:
>>> print(nltk.corpus.words.readme())
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
Unix's words is not exhaustive, so it may indeed be missing some words. Corpora are, by their nature, incomplete (hence the emphasis on natural language).
That being said, you might want to try using a corpus that is derived from a dictionary, like brown:
>>> print(nltk.corpus.brown.readme())
BROWN CORPUS
A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.
by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA
Revised 1971, Revised and Amplified 1979
http://www.hit.uib.no/icame/brown/bcm.html
Distributed with the permission of the copyright holder, redistribution permitted.
I have a bunch of unrelated paragraphs, and I need to traverse them to find similar occurrences such as that, given a search where I look for object falls, I find a boolean True for text containing:
Box fell from shelf
Bulb shattered on the ground
A piece of plaster fell from the ceiling
And False for:
The blame fell on Sarah
The temperature fell abruptly
I am able to use nltk to tokenise, tag and get Wordnet synsets, but I am finding it hard to figure out how to fit nltk's moving parts together to achieve the desired result. Should I chunk before looking for synsets? Should I write a context-free grammar? Is there a best practice when translating from treebank tags to Wordnet grammar tags? None of this is explained in the nltk book, and I couldn't find it on the nltk cookbook yet.
Bonus points for answers that include pandas in the answer.
[ EDIT ]:
Some code to get things started
In [1]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
def tag(x):
return pos_tag(word_tokenize(x))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly']
ser = Series(phrases)
ser.map(tag)
Out[1]:
0 [(Box, NNP), (fell, VBD), (from, IN), (shelf, ...
1 [(Bulb, NNP), (shattered, VBD), (on, IN), (the...
2 [(A, DT), (piece, NN), (of, IN), (plaster, NN)...
3 [(The, DT), (blame, NN), (fell, VBD), (on, IN)...
4 [(Berlin, NNP), (fell, VBD), (on, IN), (May, N...
5 [(The, DT), (temperature, NN), (fell, VBD), (a...
dtype: object
The way I would do it is the following:
Use nltk to find nouns followed by one or two verbs. In order to match your exact specifications I would use Wordnet:
The only nouns (NN, NNP, PRP, NNS) that should be found are the ones that are in a semantic relation with "physical" or "material" and the only verbs (VB, VBZ, VBD, etc...) that should be found are the ones that are in a semantic relation with "fall".
I mentioned "one or two verbs" because a verb can be preceded by an auxiliary. What you could also do is create a dependency tree to spot subject-verb relations, but it does not seem to be necessary in this case.
You might also want to make sure you exclude location names and keep person names (Because you would accept "John has fallen" but not "Berlin has fallen"). This can also be done with Wordnet, locations have the tag 'noun.location'.
I am not sure in which context you would have to convert the tags so I cannot provide a proper answer to that, in seems to me that you might not need that in this case: You use the POS tags to identify nouns and verbs and then you check if each noun and verb belong to a synset.
Hope this helps.
Not perfect, but most of the work is there. Now on to hardcoding pronouns (such as 'it') and closed-class words and adding multiple targets to handle things like 'shattered'. Not a single-liner, but not an impossible task!
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series, DataFrame
import collections
from nltk import wordnet
wn = wordnet.wordnet
def tag(x):
return pos_tag(word_tokenize(x))
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
for sub in flatten(el):
yield sub
else:
yield el
def noun_verb_match(phrase, nouns, verbs):
res = []
for i in range(len(phrase) -1):
if (phrase[i][1] in nouns) &\
(phrase[i + 1][1] in verbs):
res.append((phrase[i], phrase[i + 1]))
return res
def hypernym_paths(word, pos):
res = [x.hypernym_paths() for x in wn.synsets(word, pos)]
return set(flatten(res))
def bool_syn(double, noun_syn, verb_syn):
"""
Returns boolean if noun/verb double contains the target Wordnet Synsets.
Arguments:
double: ((noun, tag), (verb, tag))
noun_syn, verb_syn: Wordnet Synset string (i.e., 'travel.v.01')
"""
noun = double[0][0]
verb = double[1][0]
noun_bool = wn.synset(noun_syn) in hypernym_paths(noun, 'n')
verb_bool = wn.synset(verb_syn) in hypernym_paths(verb, 'v')
return noun_bool & verb_bool
def bool_loop(l, f):
"""
Tests all list elements for truthiness and
returns True if any is True.
Arguments:
l: List.
e: List element.
f: Function returning boolean.
"""
if len(l) == 0:
return False
else:
return f(l[0]) | bool_loop(l[1:], f)
def bool_noun_verb(series, nouns, verbs, noun_synset_target, verb_synset_target):
tagged = series.map(tag)
nvm = lambda x: noun_verb_match(x, nouns, verbs)
matches = tagged.apply(nvm)
bs = lambda x: bool_syn(x, noun_synset_target, verb_synset_target)
return matches.apply(lambda x: bool_loop(x, bs))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly',
'It fell on the floor']
nouns = "NN NNP PRP NNS".split()
verbs = "VB VBD VBZ".split()
noun_synset_target = 'artifact.n.01'
verb_synset_target = 'travel.v.01'
df = DataFrame()
df['text'] = Series(phrases)
df['fall'] = bool_noun_verb(df.text, nouns, verbs, noun_synset_target, verb_synset_target)
df
I've come up with the below. I've narrowed down the problem to the inability to capture both 1-word and 2-word proper nouns.
(1) It would be great if i could put in a condition that instructs a default to the longer word when given a choice between two captures.
AND
(2) if I could tell the regex to only consider this if the string starts with a prepositoin, such as On|At|For. I was playing around with something like this but it isn't working:
(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})
How would I do 1 and 2?
my current regex
r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'
I'd like to capture, Ashoka, Shift Series, Compass Partners, and Kenneth Cole
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',
What you're trying to do here is called "named entity recognition" in natural language processing. If you really want an approach that will find proper nouns, then you may have to consider stepping up to named entity recognition. Thankfully there's some easy to use functions in the nltk library:
import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)
Results:
res.productions()
Out[8]:
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]
I would use an NLP tool, the most popular for python seems to be nltk. Regular expressions are really not the right way to go... There's an example on the frontpage of the nltk site, linked to earlier in the answer, which is copy-pasted below:
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
entities now contains your words tagged according to the Penn treebank
Not entirely correct, but this will match most of what you are looking for, with the exception of On.
import re
text = """
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)
print matches
output:
[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]
And then maybe you could implement a filter to go over this list.
def filter_false_positive(unfiltered_matches):
filtered_matches = []
black_list = ["an","on","in","foo","bar"] #etc
for match in unfiltered_matches:
if match.lower() not in black_list:
filtered_matches.append(match)
return filtered_matches
or because python is cool:
def filter_false_positive(unfiltered_matches):
black_list = ["an","on","in","foo","bar"] #etc
return [match for match in filtered_matches if match.lower() not in black_list]
and you could use it like this:
# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches
giving the final output:
['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']
The problem of determining whether a word is capitalized due to occuring at the beginning of the sentance or whether it is a proper noun is not that trivial.
'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'
In this case it is pretty difficult, so without something that can know a proper noun by other standards, a black list, a database, etc. it won't be so easy. regex is awesome but I don't think it can interpret English on a grammatical level in any trivial way...
That being said, good luck!