NLTK wordnet does not contain vocabulary term - Python

NLTK wordnet does not contain vocabulary term - Python - python

A function in my program finds the definitions of certain vocabulary words, which will be useful for other parts of my program. However, it seems that not every vocabulary word is present in wordnet.
I find the definitions as follows:
y = wn.synset(w + '.n.01').definition()
where 'w' is one of the many vocabulary words being fed from a list (did not include rest of the program because it has too much irrelevant code). When the list reaches the term 'ligase', however, the following error comes up:
line 1298, in synset
raise WordNetError(message % (lemma, pos))
nltk.corpus.reader.wordnet.WordNetError: no lemma 'ligase' with part of speech 'n'
Is there anyway to bypass this or a different way to find the definition of these terms not in wordnet? My program is going through various scientific terms, so this may occur more often as I add more words to the list.

You should not make an assumption that a word is known to WordNet. Check if there are any relevant synsets, and ask for a definition only if there is at least one:
for word in ["ligase", "world"]: # Your word list
ss = wn.synsets(word)
if ss:
definition = ss[0].definition()
print("{}: {}".format(word, definition))
else:
print("### Word {} not found".format(word))
#### Word ligase not found
#world: everything that exists anywhere

Related

Why "shining" becomes "shin" after lemmatized using python nltk?

there are several words that use "-ing" as present continuous like "shining". but when I try to lemmatize "shining" using nltk, it changes into "shin". the code is this:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word = "shining"
newlemma = wordnet_lemmatizer.lemmatize(word,'v')
print newlemma
even without using 'v', it still the same "shining" and doesn't change.
I'm expecting output "shine".
anybody can help? thanks

Because of the way WordNet applies rules and exception lists when searching for the root form.
It has a list of rules particularly to remove word endings, for instance:
"ing" -> ""
"ing" -> "e"
It applies the rules and see if the resulting word form exists in WordNet. So for instance, with mining, it would try min and not find anything. Then it would try mine (second rule) and find mine is valid word and return it. But with shining, it likely tries shin, finds shin in the list of valid words and believes this to be the proper root, so it returns it.

Cross-Lingual Word Sense Disambiguation

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)

I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

How to find polysemy words from input query?

I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words

WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.

General synonym and part of speech processing using nltk

I'm trying to create a general synonym identifier for the words in a sentence which are significant (i.e. not "a" or "the"), and I am using the natural language toolkit(nltk) in python for it. The problem I am having is that the synonym finder in nltk requires a part of speech argument in order to be linked to its synonyms. My attempted fix for this was to use the simplified part of speech tagger present in nltk, and then reduce the first letter in order to pass this argument into the synonym finder, however this is not working.
def synonyms(Sentence):
Keywords = []
Equivalence = WordNetLemmatizer()
Stemmer = stem.SnowballStemmer('english')
for word in Sentence:
word = Equivalence.lemmatize(word)
words = nltk.word_tokenize(Sentence.lower())
text = nltk.Text(words)
tags = nltk.pos_tag(text)
simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags]
for tag in simplified_tags:
print tag
grammar_letter = tag[1][0].lower()
if grammar_letter != 'd':
Call = tag[0].strip() + "." + grammar_letter.strip() + ".01"
print Call
Word_Set = wordnet.synset(Call)
paths = Word_Set.lemma_names
for path in paths:
Keywords.append(Stemmer.stem(path))
return Keywords
This is the code I am currently working from, and as you can see I am first lemmatizing the input to reduce the number of matches I will have in the long run (I plan on running this on tens of thousands of sentences), and in theory I would be stemming the word after this to further this effect and reduce the number of redundant words I generate, however this method almost invariably returns errors in the form of the one below:
Traceback (most recent call last):
File "C:\Python27\test.py", line 45, in <module>
synonyms('spray reddish attack force')
File "C:\Python27\test.py", line 39, in synonyms
Word_Set = wordnet.synset(Call)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1016, in synset
raise WordNetError(message % (lemma, pos))
WordNetError: no lemma 'reddish' with part of speech 'n'
I don't have much control over the data this will be running over, and so simply cleaning my corpus is not really an option. Any ideas on how to solve this one?
I did some more research and I have a promising lead, but I'm still not sure how I could implement it. In the case of a not found, or incorrectly assigned word I would like to use a similarity metric(Leacock Chodorow, Wu-Palmer etc.) to link the word to the closest correctly categorized other keyword. Perhaps in conjunction with an edit distance measure, but again I haven't been able to find any kind of documentation on this.

Apparently nltk allows for the retrieval of all synsets associated with a word. Granted, there are usually a number of them reflecting different word senses. In order to functionally find synonyms (or if two words are synonyms) you must attempt to match the closest synonym set possible, which is possible through any of the similarity metrics mentioned above. I crafted up some basic code to do this, as shown below, how to find if two words are synonyms:
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
import itertools
def Synonym_Checker(word1, word2):
"""Checks if word1 and word2 and synonyms. Returns True if they are, otherwise False"""
equivalence = WordNetLemmatizer()
word1 = equivalence.lemmatize(word1)
word2 = equivalence.lemmatize(word2)
word1_synonyms = wordnet.synsets(word1)
word2_synonyms = wordnet.synsets(word2)
scores = [i.wup_similarity(j) for i, j in list(itertools.product(word1_synonyms, word2_synonyms))]
max_index = scores.index(max(scores))
best_match = (max_index/len(word1_synonyms), max_index % len(word1_synonyms)-1)
word1_set = word1_synonyms[best_match[0]].lemma_names
word2_set = word2_synonyms[best_match[1]].lemma_names
match = False
match = [match or word in word2_set for word in word1_set][0]
return match
print Synonym_Checker("tomato", "Lycopersicon_esculentum")
I may try to implement progressively stronger stemming algorithms, but for the first few tests I did, this code actually worked for every word I could find. If anyone has ideas on how to improve this algorithm, or has anything to improve this answer in any way I would love to hear it.

Can you wrap your Word_Set = wordnet.synset(Call) with a try: and ignore the WordNetError exception? Looks like the error you have is that some words are not categorized correctly, but this exception would also occur for unrecognized words, so catching the exception just seems like a good idea to me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

NLTK wordnet does not contain vocabulary term - Python - python

Related

Why "shining" becomes "shin" after lemmatized using python nltk?

Cross-Lingual Word Sense Disambiguation

Words.word() from nltk corpus seemingly contains strange non-valid words

How to find polysemy words from input query?

General synonym and part of speech processing using nltk

Categories

Resources