comparing synonyms NLTK [duplicate] - python

This question already has answers here:
All synonyms for word in python? [duplicate]
(3 answers)
Closed 7 years ago.
I can't come up with a stranger problem, guess you'll help me.
for p in wn.synsets('change'):<br>
print(p)
Getting:
Synset('change.n.01')
Synset('change.n.02')
Synset('change.n.03')
Synset('change.n.04')
Synset('change.n.05')
Synset('change.n.06')
Synset('change.n.07')
Synset('change.n.08')
Synset('change.n.09')
Synset('variety.n.06')
Synset('change.v.01')
Synset('change.v.02')
Synset('change.v.03')
Synset('switch.v.03')
Synset('change.v.05')
Synset('change.v.06')
Synset('exchange.v.01')
Synset('transfer.v.06')
Synset('deepen.v.04')
Synset('change.v.10')
For example I have an a string:
a = 'transfer'
I'd like to be able to identify all kinds of synonyms of word 'change' and know f.e. 'transfer' is the one of them. How can I ask my program:
"Is 'transfer' is one of the synonyms of 'change'?"

Firstly, wordnet indexes concepts (aka Synsets) and link possible words for each concept, the following code shows the concepts link to the word 'change':
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('change')
[Synset('change.n.01'), Synset('change.n.02'), Synset('change.n.03'), Synset('change.n.04'), Synset('change.n.05'), Synset('change.n.06'), Synset('change.n.07'), Synset('change.n.08'), Synset('change.n.09'), Synset('variety.n.06'), Synset('change.v.01'), Synset('change.v.02'), Synset('change.v.03'), Synset('switch.v.03'), Synset('change.v.05'), Synset('change.v.06'), Synset('exchange.v.01'), Synset('transfer.v.06'), Synset('deepen.v.04'), Synset('change.v.10')]
A synset has several properties, it has:
ID number
Part-of-Speech label
definition
lemma names, i.e. the possible words that can be used to instantiate the concept
links to other synset by N-nymy relations (e.g. hypernym, hyponym, meronym)
Here's how to interface the above properties in NLTK:
>>> wn.synsets('change')[0]
Synset('change.n.01')
>>> wn.synsets('change')[0].offset()
7296428
>>> wn.synsets('change')[0].pos()
u'n'
>>> wn.synsets('change')[0].definition()
u'an event that occurs when something passes from one state or phase to another'
>>> wn.synsets('change')[0].lemma_names()
[u'change', u'alteration', u'modification']
>>> wn.synsets('change')[0].hypernyms()
[Synset('happening.n.01')]
But a synset doesn't necessary have synonym relations. If we define synonyms as words that have similar meaning, it is the words (i.e. lemmas) that have synonymy relations. In addition, the context of the words defines whether a word is a synonym of another. A single word has limited meaning, it's the "concept" that contains meaning and instantiate the meaning through human words. At least that's the typical theory of semantics, see chapter 2 in http://goo.gl/ZHzlNF
So when you want to ask is 'transfer' a synonym of 'change', you have to first:
define/select the concept you're referring to here and provide the context where 'transfer' is used, google Word Sense Disambiguation
define which concept of change are you referring to.
Then comparison of meaning is possible.
See also:
All synonyms for word in python?
How to get synonyms from nltk WordNet Python

You need to first get the lemmas then iterate over your lemmas and get the names then check the membership with in operand:
>>> a in [j.name() for i in wn.synsets('change') for j in i.lemmas()]
True
>>> [j.name() for i in wn.synsets('change') for j in i.lemmas()]
[u'change', u'alteration', u'modification', u'change', u'change', u'change', u'change', u'change', u'change', u'change', u'change', u'variety', u'change', u'change', u'alter', u'modify', u'change', u'change', u'alter', u'vary', u'switch', u'shift', u'change', u'change', u'change', u'exchange', u'commute', u'convert', u'exchange', u'change', u'interchange', u'transfer', u'change', u'deepen', u'change', u'change']

wn.synsets gives you the list of meanings, each meaning has a list of words.
for sense in wn.synsets('change'):
if "transfer" in sense.lemma_names:
print "'transfer' is synonym to 'change'"
break

Those are different senses of the word. you can obtain synonyms of each sense using synset('xxx').lemma_names. Then you can compare if the word is present in them.

Related

How to find and rank all prefixes in a list of strings?

I have a list of strings and I want to find popular prefixes. The prefixes are special in that they occur as strings in the input list.
I found a similar question here but the answers are geared to find the one most common prefix:
Find *most* common prefix of strings - a better way?
While my problem is similar, it differs in that I need to find all popular prefixes. Or to maybe state it a little simplistically, rank prefixes from most common to least.
As an example, consider the following list of strings:
in, india, indian, indian flag, bull, bully, bullshit
Prefixes rank:
in - 4 times
india - 3 times
bull - 3 times
...and so on. Please note - in, bull, india are all present in the input list.
The following are not valid prefixes:
ind
bu
bul
...since they do not occur in the input list.
What data structure should I be looking at to model my solution? I'm inclined to use a "trie" with a counter on each node that tracks how many times has that node been touched during the creation of the trie.
All suggestions are welcome.
Thanks.
p.s. - I love python and would love if someone could post a quick snippet that could get me started.
words = [ "in", "india", "indian", "indian", "flag", "bull", "bully", "bullshit"]
Result = sorted([ (sum([ w.startswith(prefix) for w in words ]) , prefix ) for prefix in words])[::-1]
it goes through every word as a prefix and checks how many of the other words start with it and then sorts the result. the[::-1] simply reverses that order
If we know the length of the prefix (say 3)
from nltk import FreqDist
suffixDist=FreqDist()
for word in vocabulary:
suffixDist[word[-3:]] +=1
commonSuffix=[suffix for (suffix,count) in suffixDist.most_common(150) ]
print(commonSuffix)

How to un-stem a word in Python?

I want to know if there is anyway that I can un-stem them to a normal form?
The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.
But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other
How should I do it? Is there a way to un-stem a word?
I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.
A possible implementations could be something like this:
import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()
An stemmer to use. Here a text to use:
complete_text = ''' cats catlike catty cat
stemmer stemming stemmed stem
fishing fished fisher fish
argue argued argues arguing argus argu
argument arguments argument '''
Create a list with the different words:
my_list = []
#for i in complete_text.decode().split():
try:
aux = complete_text.decode().split()
except:
aux = complete_text.split()
for i in aux:
if i not in my_list:
my_list.append(i.lower())
my_list
with output:
['cats',
'catlike',
'catty',
'cat',
'stemmer',
'stemming',
'stemmed',
'stem',
'fishing',
'fished',
'fisher',
'fish',
'argue',
'argued',
'argues',
'arguing',
'argus',
'argu',
'argument',
'arguments']
An now create the dictionary:
aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict
Which output is:
{'argu': 'argue, argued, argues, arguing, argus, argu',
'argument': 'argument, arguments',
'cat': 'cats, cat',
'catlik': 'catlike',
'catti': 'catty',
'fish': 'fishing, fished, fish',
'fisher': 'fisher',
'stem': 'stemming, stemmed, stem',
'stemmer': 'stemmer'}
Companion notebook here.
No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.
tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.
You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA
On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.
First, you will stem some documents, here short (French language) strings with their stop words removed for example:
['sup chat march trottoir',
'sup chat aiment ronron',
'chat ronron',
'sup chien aboi',
'deux sup chien',
'combien chien train aboi']
Then the trick is to have kept the count of the most popular original words with counts for each stemmed word:
{'aboi': {'aboie': 1, 'aboyer': 1},
'aiment': {'aiment': 1},
'chat': {'chat': 1, 'chats': 2},
'chien': {'chien': 1, 'chiens': 2},
'combien': {'Combien': 1},
'deux': {'Deux': 1},
'march': {'marche': 1},
'ronron': {'ronronner': 1, 'ronrons': 1},
'sup': {'super': 4},
'train': {'train': 1},
'trottoir': {'trottoir': 1}}
Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py
Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.
I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.
check out the pattern package
pip install pattern
Then use the en.lemma function to return a verb's base form.
import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"
Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.
For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.
But again theoretically I believe it could work. I haven't seen that though in any implementation.

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

All synonyms for word in python? [duplicate]

This question already has answers here:
How to get synonyms from nltk WordNet Python
(8 answers)
Closed 7 years ago.
The code to get the synonyms of a word in python is say:
from nltk.corpus import wordnet
dog = wordnet.synset('dog.n.01')
print dog.lemma_names
>>['dog', 'domestic_dog', 'Canis_familiaris']
However dog.n.02 gives different words. For any words i can't know how many words there may be. How can i return all of the synonyms for a word?
Using wn.synset('dog.n.1').lemma_names is the correct way to access the synonyms of a sense. It's because a word has many senses and it's more appropriate to list synonyms of a particular meaning/sense. To enumerate words with similar meanings, possibly you can also look at the hyponyms.
Sadly, the size of Wordnet is very limited so there are very few lemma_names available for each senses.
Using Wordnet as a dictionary/thesarus is not very apt per se, because it was developed as an inventory of sense/meaning rather than a inventory of words. However you can use access the a particular sense and several (not a lot) related words to the sense. One can use Wordnet as a:
Dictionary: given a word, what are the different meaning of the word
for i,j in enumerate(wn.synsets('dog')):
print "Meaning",i, "NLTK ID:", j.name
print "Definition:",j.definition
Thesarus: given a word, what are the different words for each meaning of the word
for i,j in enumerate(wn.synsets('dog')):
print "Meaning",i, "NLTK ID:", j.name
print "Definition:",j.definition
print "Synonyms:", ", ".join(j.lemma_names)
print
Ontology: given a word, what are the hyponyms (i.e. sub-types) and hypernyms (i.e. super-types).
for i,j in enumerate(wn.synsets('dog')):
print "Meaning",i, "NLTK ID:", j.name
print "Hypernyms:", ", ".join(list(chain(*[l.lemma_names for l in j.hypernyms()])))
print "Hyponyms:", ", ".join(list(chain(*[l.lemma_names for l in j.hyponyms()])))
print
[Ontology Output]
Meaning 0 NLTK ID: dog.n.01
Hypernyms words domestic_animal, domesticated_animal, canine, canid
Hyponyms puppy, Great_Pyrenees, basenji, Newfoundland, Newfoundland_dog, lapdog, poodle, poodle_dog, Leonberg, toy_dog, toy, spitz, pooch, doggie, doggy, barker, bow-wow, cur, mongrel, mutt, Mexican_hairless, hunting_dog, working_dog, dalmatian, coach_dog, carriage_dog, pug, pug-dog, corgi, Welsh_corgi, griffon, Brussels_griffon, Belgian_griffon
Meaning 1 NLTK ID: frump.n.01
Hypernyms: unpleasant_woman, disagreeable_woman
Hyponyms:
Meaning 2 NLTK ID: dog.n.03
Hypernyms: chap, fellow, feller, fella, lad, gent, blighter, cuss, bloke
Hyponyms:
Meaning 3 NLTK ID: cad.n.01
Hypernyms: villain, scoundrel
Hyponyms: perisher
Note this other answer:
>>> wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
Keep in mind that in your code you were trying to get the lemmas, but that's one level too deep for what you want. The synset level is about meaning, while the lemma level gives you words. In other words:
In WordNet (and I’m speaking of English WordNet here, though I think
the ones in other langauges are similarly organized) a lemma has
senses. Specifically, a lemma (that is, a base word form that is
indexed in WordNet) has exactly as many senses as the number of
synsets that it participates in. Conversely, and as you say, synsets
contain one more more lemmas, which means that multiple lemmas (words)
can represent the same sense, or meaning.
Also have a look at the NLTK's WordNet how to for a few more ways of exploring around a meaning or a word.
The documentation suggests
wordnet.synsets('dog')
to get all synsets for dog.

0th synset in NLTK wordnet interface

From the semcor corpus (http://www.cse.unt.edu/~rada/downloads.html), there are senses wasn't mapped to the later versions of wordnet. And magically, the mapping can be found in the NLTK WordNet API as such:
>>> from nltk.corpus import wordnet as wn
# Emunerate the possible senses for the lemma 'delayed'
>>> wn.synsets('delayed')
[Synset('delay.v.01'), Synset('delay.v.02'), Synset('stay.v.06'), Synset('check.v.07'), Synset('delayed.s.01')]
>>> wn.synset('delay.v.01')
Synset('delay.v.01')
# Magically, there is a 0th sense of the word!!!
>>> wn.synset('delayed.a.0')
Synset('delayed.s.01')
I've checked the code and the API (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet.Synset-class.html, http://nltk.org/_modules/nltk/corpus/reader/wordnet.html) but i can't find how they did the magically mapping that didn't shouldn't exist (e.g. for delayed.a.0 -> delayed.s.01).
Does anyone know which part of the NLTK Wordnet API code does the magical mapping?
It's a bug I guess. When you do wn.synset('delayed.a.0') the first two lines in the method are:
lemma, pos, synset_index_str = name.lower().rsplit('.', 2)
synset_index = int(synset_index_str) - 1
So in this case the value of synset_index is -1 which is a valid index in python. And it won't fail when looking up in the array of synsets whose lemma is delayed and pos is a.
With this behavior you can do tricky things like:
>>> wn.synset('delay.v.-1')
Synset('stay.v.06')

Categories