How to find synonyms using the relations in wordnet - python

I am new to NLP, NLTK and Python. I am using wordnet to get the synonyms for a word in given sentence. I am using the below code to get the synonyms and lemma names of those words
synonyms = wordnet.synsets(w,pos)
lemmas.append(list( set(chain.from_iterable([w.lemma_names() for w in synonyms]))))
eg : wordnet.synsets("get",'v')
The lemma_names for this word "get" returns many things which are irrelevant for me.
My search string is "error getting the report". lemma_names has even "buzz off", "gets under one's skin" which are not correct for my statement.
So is there a way get synonyms which are relevant to the statement? is there any concept or algorithms that I can check for?

Related

nltk "OMW" wordnet with Arabic language

I'm working on python/nltk with (OMW) wordnet specifically for The Arabic language. All the functions work fine with the English language yet I can't seem to be able to perform any of them when I use the 'arb' tag. The only thing that works great is extracting the lemma_names from a given Arabic synset.
The code below works fine with u'arb':
The output is a list of Arabic lemmas.
for synset in wn.synsets(u'عام',lang=('arb')):
for lemma in synset.lemma_names(u'arb'):
print lemma
When I try to perform the same logic as the code above with synset, definitions, example, hypernyms, I get an error which says:
TypeError: hyponyms() takes exactly 1 argument (2 given)
(if I supply the 'arb' flag) or
KeyError: u'arb'
This is one of the codes that will not work if I write synset.hyponyms(u'arb'):
for synset in wn.synsets(u'عام',lang=('arb')):
for hypo in synset.hyponyms(): #print the hyponyms in English not Arabic
print hypo
Does this mean that I can't get to use wn.all_synsets and other built-in functions to extract all the Arabic synsets, hypernyms, etc?
The nltk's Open Multilingual Wordnet has English names for all the synsets, since it is a multilingual database centered on the original English Wordnet. Synsets model meanings, hence they are language-independent and cannot be requested in a specific language. But each synset is linked to lemmas for the languages covered by the OMW. Once you have some synsets (original, hyponyms, etc.), just ask for the Arabic lemmas again:
>>> for synset in wn.synsets(u'عام',lang=('arb')):
... for hypo in synset.hyponyms():
... for lemma in hypo.lemmas("arb"):
... print(lemma)
...
Lemma('waft.v.01.إِنْبعث')
Lemma('waft.v.01.انبعث')
Lemma('waft.v.01.إنبعث_كالرائحة_العطرة')
Lemma('waft.v.01.إِنْدفع')
Lemma('waft.v.01.إِنْطلق')
Lemma('waft.v.01.انطلق')
Lemma('waft.v.01.حمل_بخفة')
Lemma('waft.v.01.دفع')
Lemma('calendar_year.n.01.سنة_شمْسِيّة')
Lemma('calendar_year.n.01.سنة_مدنِيّة')
Lemma('fiscal_year.n.01.سنة_ضرِيبِيّة')
Lemma('fiscal_year.n.01.سنة_مالِيّة')
In other words, the lemmas are multilingual, the synsets are not.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)
I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

NLTK words lemmatizing

I am trying to do lemmatization on words with NLTK.
What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement".
When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer(), it returns "acknowledg" rather than "acknowledge".
Can anyone tell me how to eliminate the affixes of words?
Say, when input is "acknowledgement", the output to be "acknowledge"
Lemmatization does not (and should not) return "acknowledge" for "acknowledgement". The former is a verb, while the latter is a noun. Porter's stemming algorithm, on the other hand, simply uses a fixed set of rules. So, your only way there is to change the rules at source. (NOT the right way to fix your problem).
What you are looking for is the derivationally related form of "acknowledgement", and for this, your best source is WordNet. You can check this online on WordNet.
There are quite a few WordNet-based libraries that you can use for this (e.g. in JWNL in Java). In Python, NLTK should be able to get the derivationally related form you saw online:
from nltk.corpus import wordnet as wn
acknowledgment_synset = wn.synset('acknowledgement.n.01')
acknowledgment_lemma = acknowledgment_synset.lemmas[1]
print(acknowledgment_lemma.derivationally_related_forms())
# [Lemma('admit.v.01.acknowledge'), Lemma('acknowledge.v.06.acknowledge')]

Finding synonyms for a certain word creates a WordNetError

I am attempting to get synonyms for a word using the python library NLTK.
My Problem: Some words create an error when I use them. For example 'eat' throws a WordNetError of "WordNetError: no lemma 'eat' with part of speech 'n'". What does that mean? How can I retrieve synonyms for the word eat?
Here's my code, note how words like 'dog' do work:
from nltk.corpus import wordnet as wn
print wn.synset("dog.n.01").lemma_names
print wn.synset("eat.n.01").lemma_names
Also is it possible to get synonyms for a group of words? For example; for 'main course', can I get the synonyms 'main dish', 'main meal', 'dinner'?
The error says no lemma 'eat' with part of speech 'n'. That means that "eat" isn't in WordNet as a noun. Try it as a verb:
>>> wn.synset('eat.v.01').lemma_names
['eat']

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories