Using Wordnet Synsets from Python for Italian Language - python

I'm starting to program with NLTK in Python for Natural Italian Language processing. I've seen some simple examples of the WordNet Library that has a nice set of SynSet that permits you to navigate from a word (for example: "dog") to his synonyms and his antonyms, his hyponyms and hypernyms and so on...
My question is:
If I start with an italian word (for example:"cane" - that means "dog") is there a way to navigate between synonyms, antonyms, hyponyms... for the italian word as you do for the english one? Or... There is an Equivalent to WordNet for the Italian Language ?
Thanks in advance

You are in luck. The nltk provides an interface to the Open Multilingual Wordnet, which does indeed include Italian among the languages it describes. Just add an argument specifying the desired language to the usual wordnet functions, e.g.:
>>> cane_lemmas = wn.lemmas("cane", lang="ita")
>>> print(cane_lemmas)
[Lemma('dog.n.01.cane'), Lemma('cramp.n.02.cane'), Lemma('hammer.n.01.cane'),
Lemma('bad_person.n.01.cane'), Lemma('incompetent.n.01.cane')]
The synsets have English names, because they are integrated with the English wordnet. But you can navigate the web of meanings and extract the Italian lemmas for any synset you want:
>>> hypernyms = cane_lemmas[0].synset().hypernyms()
>>> print(hypernyms)
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
>>> print(hypernyms[1].lemmas(lang="ita"))
[Lemma('domestic_animal.n.01.animale_addomesticato'),
Lemma('domestic_animal.n.01.animale_domestico')]
Or since you mentioned "cattiva_persona" in the comments:
>>> wn.lemmas("bad_person")[0].synset().lemmas(lang="ita")
[Lemma('bad_person.n.01.cane'), Lemma('bad_person.n.01.cattivo')]
I went from the English lemma to the language-independent synset to the Italian lemmas.

Since I found myself wondering how to actually use the wordnet resources after reading this question and its answer, I'm going to leave here some useful information:
Here is a link to the nltk guide.
The two necessary commands to download wordnet data and thus proceed with the usage explained in the other answer are:
import nltk
nltk.download('wordnet')
nltk.download('omw')

Related

Why some word that should have in nltk corpus missing? [duplicate]

The NLTK word corpus does not have the phrase "okay", "ok", "Okay"?
> from nltk.corpus import words
> words.words().__contains__("check")
> True
> words.words().__contains__("okay")
> False
> len(words.words())
> 236736
Any ideas why?
TL;DR
from nltk.corpus import words
from nltk.corpus import wordnet
manywords = words.words() + wordnet.words()
In Long
From the docs, the nltk.corpus.words are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix)
Which in Unix, you can do:
ls /usr/share/dict/
And reading the README:
$ cd /usr/share/dict/
/usr/share/dict$ cat README
# #(#)README 8.1 (Berkeley) 6/5/93
# $FreeBSD$
WEB ---- (introduction provided by jaw#riacs) -------------------------
Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier. The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases. The wordlist makes a dandy 'grep' victim.
-- James A. Woods {ihnp4,hplabs}!ames!jaw (or jaw#riacs)
Country names are stored in the file /usr/share/misc/iso3166.
FreeBSD Maintenance Notes ---------------------------------------------
Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.
A few words have been removed because their spellings have depreciated.
This list of words includes:
corelation (and its derivatives) "correlation" is the preferred spelling
freen typographical error in original file
freend archaic spelling no longer in use;
masks common typo in modern text
--
A list of technical terms has been added in the file 'freebsd'. This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation. It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.
Since it's a fixed list of 234,936, there are bound to be words that don't exist in that list.
If you need to extend your word list, you can add to the list using the words from WordNet using nltk.corpus.wordnet.words().
Most probably, all you need is a large enough corpus of text, e.g. Wikipedia dump and then tokenize it and extract all unique words.
I am unable to comment due to low reputation, but I can offer a couple of things.
I've posted a zip file in the nltk_data issue related to this which contains a more comprehensive set of words merged in from Ubuntu18.04 /usr/share/dict/american-english
There are some grossly missing words in the original /usr/share/dict files, such as 'failed' and 'failings'. Unfortunately, using wordnet doesn't really resolve this; it adds 'fail-safe' and several types of failure such as 'equipment_failure' and 'renal_failure' but it doesn't add the basic words. Hopefully the supplied zipfile will be of some use.

nltk "OMW" wordnet with Arabic language

I'm working on python/nltk with (OMW) wordnet specifically for The Arabic language. All the functions work fine with the English language yet I can't seem to be able to perform any of them when I use the 'arb' tag. The only thing that works great is extracting the lemma_names from a given Arabic synset.
The code below works fine with u'arb':
The output is a list of Arabic lemmas.
for synset in wn.synsets(u'عام',lang=('arb')):
for lemma in synset.lemma_names(u'arb'):
print lemma
When I try to perform the same logic as the code above with synset, definitions, example, hypernyms, I get an error which says:
TypeError: hyponyms() takes exactly 1 argument (2 given)
(if I supply the 'arb' flag) or
KeyError: u'arb'
This is one of the codes that will not work if I write synset.hyponyms(u'arb'):
for synset in wn.synsets(u'عام',lang=('arb')):
for hypo in synset.hyponyms(): #print the hyponyms in English not Arabic
print hypo
Does this mean that I can't get to use wn.all_synsets and other built-in functions to extract all the Arabic synsets, hypernyms, etc?
The nltk's Open Multilingual Wordnet has English names for all the synsets, since it is a multilingual database centered on the original English Wordnet. Synsets model meanings, hence they are language-independent and cannot be requested in a specific language. But each synset is linked to lemmas for the languages covered by the OMW. Once you have some synsets (original, hyponyms, etc.), just ask for the Arabic lemmas again:
>>> for synset in wn.synsets(u'عام',lang=('arb')):
... for hypo in synset.hyponyms():
... for lemma in hypo.lemmas("arb"):
... print(lemma)
...
Lemma('waft.v.01.إِنْبعث')
Lemma('waft.v.01.انبعث')
Lemma('waft.v.01.إنبعث_كالرائحة_العطرة')
Lemma('waft.v.01.إِنْدفع')
Lemma('waft.v.01.إِنْطلق')
Lemma('waft.v.01.انطلق')
Lemma('waft.v.01.حمل_بخفة')
Lemma('waft.v.01.دفع')
Lemma('calendar_year.n.01.سنة_شمْسِيّة')
Lemma('calendar_year.n.01.سنة_مدنِيّة')
Lemma('fiscal_year.n.01.سنة_ضرِيبِيّة')
Lemma('fiscal_year.n.01.سنة_مالِيّة')
In other words, the lemmas are multilingual, the synsets are not.

Using Arabic WordNet for synonyms in python?

I am trying to get the synonyms for arabic words in a sentence
If the word is in English it works perfectly, and the results are displayed in Arabic language, I was wondering if its possible to get the synonym of an Arabic word right away without writing it in english first.
I tried that but it didn't work & I would prefer without tashkeel انتظار instead of اِنْتِظار
from nltk.corpus import wordnet as omw
jan = omw.synsets('انتظار ')[0]
print(jan)
print(jan.lemma_names(lang='arb'))
Wordnet used in nltk doesnt support arabic. If you are looking for Arabic Wordnet so this is a totally different thing.
For Arabic wordnet, download:
http://nlp.lsi.upc.edu/awn/get_bd.php
http://nlp.lsi.upc.edu/awn/AWNDatabaseManagement.py.gz
You run it with:
$ python AWNDatabaseManagement.py -i upc_db.xml
Now to get something like wn.synset('إنتظار'). Arabic Wordnet has a function wn.get_synsets_from_word(word), but it gives offsets. Also it accepts the words only as vocalized in the database. For example, you should use جَمِيل for جميل:
>> wn.get_synsets_from_word(u"جَمِيل")
[(u'a', u'300218842')]
300218842 is the offset of the synset of جميل .
I checked for the word إنتظار and seems it doesn't exist in AWN.
More details about using AWN to get synonyms here.

NLTK words lemmatizing

I am trying to do lemmatization on words with NLTK.
What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement".
When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer(), it returns "acknowledg" rather than "acknowledge".
Can anyone tell me how to eliminate the affixes of words?
Say, when input is "acknowledgement", the output to be "acknowledge"
Lemmatization does not (and should not) return "acknowledge" for "acknowledgement". The former is a verb, while the latter is a noun. Porter's stemming algorithm, on the other hand, simply uses a fixed set of rules. So, your only way there is to change the rules at source. (NOT the right way to fix your problem).
What you are looking for is the derivationally related form of "acknowledgement", and for this, your best source is WordNet. You can check this online on WordNet.
There are quite a few WordNet-based libraries that you can use for this (e.g. in JWNL in Java). In Python, NLTK should be able to get the derivationally related form you saw online:
from nltk.corpus import wordnet as wn
acknowledgment_synset = wn.synset('acknowledgement.n.01')
acknowledgment_lemma = acknowledgment_synset.lemmas[1]
print(acknowledgment_lemma.derivationally_related_forms())
# [Lemma('admit.v.01.acknowledge'), Lemma('acknowledge.v.06.acknowledge')]

Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?

Using NLTK and WordNet, how do I convert simple tense verb into its present, past or past participle form?
For example:
I want to write a function which would give me verb in expected form as follows.
v = 'go'
present = present_tense(v)
print present # prints "going"
past = past_tense(v)
print past # prints "went"
With the help of NLTK this can also be done. It can give the base form of the verb. But not the exact tense, but it still can be useful. Try the following code.
from nltk.stem.wordnet import WordNetLemmatizer
words = ['gave','went','going','dating']
for word in words:
print word+"-->"+WordNetLemmatizer().lemmatize(word,'v')
The output is:
gave-->give
went-->go
going-->go
dating-->date
Have a look at Stack Overflow question NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?.
I think what you're looking for is the NodeBox::Linguistics library. It does exactly that:
print en.verb.present("gave")
>>> give
For Python3:
pip install pattern
then
from pattern.en import conjugate, lemma, lexeme, PRESENT, SG
print (lemma('gave'))
print (lexeme('gave'))
print (conjugate(verb='give',tense=PRESENT,number=SG)) # he / she / it
yields
give
['give', 'gives', 'giving', 'gave', 'given']
gives
thnks to #Agargara for pointing & authors of Pattern for their beautiful work, go support them ;-)
PS. To use most of pattern's functionality in python 3.7+, you might want to use the trick described here
JWI (the WordNet library by MIT) also has a stemmer (WordNetStemmer) which converts different morphological forms of a word like ("written", "writes", "wrote") to their base form. It seems it works only for nouns (like plurals) and verbs though.
Word Stemming in Java with WordNet and JWNL also shows how to do this kind of stemming using JWNL, another Java-based Wordnet library:

Categories