Given two words, find whether they are in the same synset - python

am fairly new to nltk. I am trying to find out a solution to the
problem I am currently working on:
Given two words w1 and w2 is there a way to find out whether they belong to the same sysnet in the Wordnet database?
Also is it possible to find the list of sysnets that contain a given word?
Thanks.

Also is it possible to find the list of sysnets that contain a given
word?
Yes:
>>> from nltk.corpus import wordnet as wn
>>> auto, car = 'auto', 'car'
>>> wn.synsets(auto)
[Synset('car.n.01')]
>>> wn.synsets(car)
[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
If we look at lemmas in every synset from wn.synsets(car), we'll find "car" exist as one of the lemma:
>>> for ss in wn.synsets(car):
... assert 'car' in ss.lemma_names()
...
>>> for ss in wn.synsets(car):
... print 'car' in ss.lemma_names(), ss.lemma_names()
...
True [u'car', u'auto', u'automobile', u'machine', u'motorcar']
True [u'car', u'railcar', u'railway_car', u'railroad_car']
True [u'car', u'gondola']
True [u'car', u'elevator_car']
True [u'cable_car', u'car']
Note: A lemma is not exactly a surface word, see Stemmers vs Lemmatizers, also, you might find this helpful https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66 (Disclaimer: Shameless plug)
Given two words w1 and w2 is there a way to find out whether they
belong to the same sysnet in the Wordnet database?
Yes:
>>> from nltk.corpus import wordnet as wn
>>> auto, car = 'auto', 'car'
>>> wn.synsets(auto)
[Synset('car.n.01')]
>>> wn.synsets(car)
[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
>>> auto_ss = set(wn.synsets(auto))
>>> car_ss = set(wn.synsets(car))
>>> car_ss.intersection(auto_ss)
set([Synset('car.n.01')])

Related

How to get the gloss given sense key using Nltk WordNet?

I got a set of sense key such as "long%3:00:02::" from SemCor+OMSTI. How can I get the glosses? Is there a map file? Or using Nltk WordNet?
TL;DR
import re
from nltk.corpus import wordnet as wn
sense_key_regex = r"(.*)\%(.*):(.*):(.*):(.*):(.*)"
synset_types = {1:'n', 2:'v', 3:'a', 4:'r', 5:'s'}
def synset_from_sense_key(sense_key):
lemma, ss_type, lex_num, lex_id, head_word, head_id = re.match(sense_key_regex, sense_key).groups()
ss_idx = '.'.join([lemma, synset_types[int(ss_type)], lex_id])
return wn.synset(ss_idx)
x = "long%3:00:02::"
synset_from_sense_key(x)
In Long
There's this really obtuse function in NLTK. However, that doesn't read from the sense key but from data_file_map (e.g. "data.adj", "data.noun", etc.): https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1355
Since we already have a mere-mortal understandable API in NTLK, with some guides from https://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html ,
A sense_key is represented as:
lemma % lex_sense
where lex_sense is encoded as:
ss_type:lex_filenum:lex_id:head_word:head_id
(yada, yada...)
The synset type is encoded as follows:
1 NOUN
2 VERB
3 ADJECTIVE
4 ADVERB
5 ADJECTIVE SATELLITE
we can do this using a regex https://regex101.com/r/9KlVK7/1/:
>>> import re
>>> sense_key_regex = r"(.*)\%(.*):(.*):(.*):(.*):(.*)"
>>> x = "long%3:00:02::"
>>> re.match(sense_key_regex, x)
<_sre.SRE_Match object at 0x10061ad78>
>>> re.match(sense_key_regex, x).groups()
('long', '3', '00', '02', '', '')
>>> lemma, ss_type, lex_num, lex_id, head_word, head_id = re.match(sense_key_regex, x).groups()
>>> synset_types = {1:'n', 2:'v', 3:'a', 4:'r', 5:'s'}
>>> '.'.join([lemma, synset_types[int(ss_type)], lex_id])
'long.a.02'
And voila you get the NLTK Synset() object from the sense key =)
>>> from nltk.corpus import wordnet as wn
>>> wn.synset(idx)
Synset('long.a.02')
I solved this by download this.
http://wordnet.princeton.edu/glosstag.shtml
Use the files in WordNet-3.0\glosstag\merged to create my own map dic.
The first answer provides wrong answer. Also, there are many keys in wordnet
for which synset does not exists. For this reason, you can use the following code for wordnet 3.0:
import nltk
from nltk.corpus import wordnet as wn
def synset_from_key(sense_key):
lem=wn.lemma_from_key(sense_key)
return lem.synset()
key='england%1:15:00::'
try:
ss=synset_from_key(ky)
print(ss)
except:
print("No Synset Found.")
You can also find the definition by using:
print(ss.definition())
More details can be found at: https://www.nltk.org/howto/wordnet.html

NLTK WordNet verb hierarchy

I spotted some problems with WordNet's hierarchy for verbs.
For example,
a.lowest_common_hypernyms(wn.synset('love.v.02')) returns [].
Isn't there a common ancestor like entity for verbs as well ?
Are verbs even connected to nouns in the same hierarchy ?
To find the top hypernym of any synset, use the Synset.root_hypernyms() function, e.g.:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('car')[0].root_hypernyms()
[Synset('entity.n.01')]
>>> wn.synsets('love')[0].root_hypernyms()
[Synset('entity.n.01')]
>>> wn.synsets('love', 'v')[0].root_hypernyms()
[Synset('love.v.01')]
It seems that there's no overarching/umbrella hypernym that covers all verbs, unlike nouns that covered by entity.n.01:
>>> root_hypernyms_of_nouns = Counter(chain(*[ss.root_hypernyms() for ss in wn.all_synsets(pos='n')]))
>>> len(root_hypernyms_of_nouns)
1
>>> root_hypernyms_of_nouns.items()
[(Synset('entity.n.01'), 82115)]
But you can try to iterate through all verbs, e.g.:
wn.all_synsets(pos='v')
And try to find the top most hypernyms for verbs (it will be a rather large list):
>>> from collections import Counter
>>> from itertools import chain
>>> root_hypernyms_of_verbs = Counter(chain(*[ss.root_hypernyms() for ss in wn.all_synsets(pos='v')]))
>>> root_hypernyms_of_verbs.most_common(10)
[(Synset('change.v.01'), 1704), (Synset('change.v.02'), 1295), (Synset('act.v.01'), 1083), (Synset('move.v.02'), 1027), (Synset('make.v.03'), 659), (Synset('travel.v.01'), 526), (Synset('think.v.03'), 451), (Synset('transfer.v.05'), 420), (Synset('move.v.03'), 329), (Synset('connect.v.01'), 262)]
>>> root_hypernyms_of_verbs.keys() # This will return all root_hypernyms.
Visuwords has a very pretty interactive graph that you can use to look through the WordNet hierarchy manually, http://visuwords.com/entity

Python: Retrieving WordNet hypernyms from offset input

I know how to get hypernyms of words, like so :
word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')
My question is, if I wanted to search for the hypernym of an offset, how would I change this current code?
For example, given the offset 01234567-n its hypernyms are outputted. The hypernyms can either be outputted in synset form like my example, or (and preferably) as offset form. Thanks.
Here's a cute function from pywsd that's originally from http://moin.delph-in.net/SemCor
def offset_to_synset(offset):
"""
Look up a synset given offset-pos
(Thanks for #FBond, see http://moin.delph-in.net/SemCor)
>>> synset = offset_to_synset('02614387-v')
>>> print '%08d-%s' % (synset.offset, synset.pos)
>>> print synset, synset.definition
02614387-v
Synset('live.v.02') lead a certain kind of life; live in a certain style
"""
return wn._synset_from_pos_and_offset(str(offset[-1:]), int(offset[:8]))

0th synset in NLTK wordnet interface

From the semcor corpus (http://www.cse.unt.edu/~rada/downloads.html), there are senses wasn't mapped to the later versions of wordnet. And magically, the mapping can be found in the NLTK WordNet API as such:
>>> from nltk.corpus import wordnet as wn
# Emunerate the possible senses for the lemma 'delayed'
>>> wn.synsets('delayed')
[Synset('delay.v.01'), Synset('delay.v.02'), Synset('stay.v.06'), Synset('check.v.07'), Synset('delayed.s.01')]
>>> wn.synset('delay.v.01')
Synset('delay.v.01')
# Magically, there is a 0th sense of the word!!!
>>> wn.synset('delayed.a.0')
Synset('delayed.s.01')
I've checked the code and the API (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet.Synset-class.html, http://nltk.org/_modules/nltk/corpus/reader/wordnet.html) but i can't find how they did the magically mapping that didn't shouldn't exist (e.g. for delayed.a.0 -> delayed.s.01).
Does anyone know which part of the NLTK Wordnet API code does the magical mapping?
It's a bug I guess. When you do wn.synset('delayed.a.0') the first two lines in the method are:
lemma, pos, synset_index_str = name.lower().rsplit('.', 2)
synset_index = int(synset_index_str) - 1
So in this case the value of synset_index is -1 which is a valid index in python. And it won't fail when looking up in the array of synsets whose lemma is delayed and pos is a.
With this behavior you can do tricky things like:
>>> wn.synset('delay.v.-1')
Synset('stay.v.06')

Getting adjective from an adverb in nltk or other NLP library

Is there a way to get an adjective corresponding to a given adverb in NLTK or other python library.
For example, for the adverb "terribly", I need to get "terrible".
Thanks.
There is a relation in wordnet that connects the adjectives to adverbs and vice versa.
>>> from itertools import chain
>>> from nltk.corpus import wordnet as wn
>>> from difflib import get_close_matches as gcm
>>> possible_adjectives = [k.name for k in chain(*[j.pertainyms() for j in chain(*[i.lemmas for i in wn.synsets('terribly')])])]
['terrible', 'atrocious', 'awful', 'rotten']
>>> gcm('terribly',possible_adjectives)
['terrible']
A more human readable way to computepossible_adjective is as followed:
possible_adj = []
for ss in wn.synsets('terribly'):
for lemmas in ss.lemmas: # all possible lemmas.
for lemma in lemmas:
for ps in lemma.pertainyms(): # all possible pertainyms.
for p in ps:
for ln in p.name: # all possible lemma names.
possible_adj.append(ln)
EDIT: In the newer version of NLTK:
possible_adj = []
for ss in wn.synsets('terribly'):
for lemmas in ss.lemmas(): # all possible lemmas
for ps in lemmas.pertainyms(): # all possible pertainyms
possible_adj.append(ps.name())
As MKoosej mentioned, nltk's lemmas is no longer an attribute but a method. I also made a little simplification to get the most possible word. Hope someone else can use it also:
wordtoinv = 'unduly'
s = []
winner = ""
for ss in wn.synsets(wordtoinv):
for lemmas in ss.lemmas(): # all possible lemmas.
s.append(lemmas)
for pers in s:
posword = pers.pertainyms()[0].name()
if posword[0:3] == wordtoinv[0:3]:
winner = posword
break
print winner # undue

Categories