I'm trying to create a general synonym identifier for the words in a sentence which are significant (i.e. not "a" or "the"), and I am using the natural language toolkit(nltk) in python for it. The problem I am having is that the synonym finder in nltk requires a part of speech argument in order to be linked to its synonyms. My attempted fix for this was to use the simplified part of speech tagger present in nltk, and then reduce the first letter in order to pass this argument into the synonym finder, however this is not working.
def synonyms(Sentence):
Keywords = []
Equivalence = WordNetLemmatizer()
Stemmer = stem.SnowballStemmer('english')
for word in Sentence:
word = Equivalence.lemmatize(word)
words = nltk.word_tokenize(Sentence.lower())
text = nltk.Text(words)
tags = nltk.pos_tag(text)
simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags]
for tag in simplified_tags:
print tag
grammar_letter = tag[1][0].lower()
if grammar_letter != 'd':
Call = tag[0].strip() + "." + grammar_letter.strip() + ".01"
print Call
Word_Set = wordnet.synset(Call)
paths = Word_Set.lemma_names
for path in paths:
Keywords.append(Stemmer.stem(path))
return Keywords
This is the code I am currently working from, and as you can see I am first lemmatizing the input to reduce the number of matches I will have in the long run (I plan on running this on tens of thousands of sentences), and in theory I would be stemming the word after this to further this effect and reduce the number of redundant words I generate, however this method almost invariably returns errors in the form of the one below:
Traceback (most recent call last):
File "C:\Python27\test.py", line 45, in <module>
synonyms('spray reddish attack force')
File "C:\Python27\test.py", line 39, in synonyms
Word_Set = wordnet.synset(Call)
File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1016, in synset
raise WordNetError(message % (lemma, pos))
WordNetError: no lemma 'reddish' with part of speech 'n'
I don't have much control over the data this will be running over, and so simply cleaning my corpus is not really an option. Any ideas on how to solve this one?
I did some more research and I have a promising lead, but I'm still not sure how I could implement it. In the case of a not found, or incorrectly assigned word I would like to use a similarity metric(Leacock Chodorow, Wu-Palmer etc.) to link the word to the closest correctly categorized other keyword. Perhaps in conjunction with an edit distance measure, but again I haven't been able to find any kind of documentation on this.
Apparently nltk allows for the retrieval of all synsets associated with a word. Granted, there are usually a number of them reflecting different word senses. In order to functionally find synonyms (or if two words are synonyms) you must attempt to match the closest synonym set possible, which is possible through any of the similarity metrics mentioned above. I crafted up some basic code to do this, as shown below, how to find if two words are synonyms:
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
import itertools
def Synonym_Checker(word1, word2):
"""Checks if word1 and word2 and synonyms. Returns True if they are, otherwise False"""
equivalence = WordNetLemmatizer()
word1 = equivalence.lemmatize(word1)
word2 = equivalence.lemmatize(word2)
word1_synonyms = wordnet.synsets(word1)
word2_synonyms = wordnet.synsets(word2)
scores = [i.wup_similarity(j) for i, j in list(itertools.product(word1_synonyms, word2_synonyms))]
max_index = scores.index(max(scores))
best_match = (max_index/len(word1_synonyms), max_index % len(word1_synonyms)-1)
word1_set = word1_synonyms[best_match[0]].lemma_names
word2_set = word2_synonyms[best_match[1]].lemma_names
match = False
match = [match or word in word2_set for word in word1_set][0]
return match
print Synonym_Checker("tomato", "Lycopersicon_esculentum")
I may try to implement progressively stronger stemming algorithms, but for the first few tests I did, this code actually worked for every word I could find. If anyone has ideas on how to improve this algorithm, or has anything to improve this answer in any way I would love to hear it.
Can you wrap your Word_Set = wordnet.synset(Call) with a try: and ignore the WordNetError exception? Looks like the error you have is that some words are not categorized correctly, but this exception would also occur for unrecognized words, so catching the exception just seems like a good idea to me.
Related
I would like to know how you would find all the variations of a word, or the words that are related or very similar the the original word in Python.
An example of the sort of thing I am looking for is like this:
word = "summary" # any word
word_variations = find_variations_of_word(word) # a function that finds all the variations of a word, What i want to know how to make
print(word_variations)
# What is should print out: ["summaries", "summarize", "summarizing", "summarized"]
This is just an example of what the code should do, i have seen other similar question on this same topic, but none of them were accurate enough, i found some code and altered it to my own, which kinda works, but now to way i would like it to.
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def find_inflections(word):
inflections = []
for synset in wordnet.synsets(word): # Find all synsets for the word
for lemma in synset.lemmas(): # Find all lemmas for each synset
inflected_form = lemma.name().replace("_", " ") # Get the inflected form of the lemma
if inflected_form != word: # Only add the inflected form if it's different from the original word
inflections.append(inflected_form)
return inflections
word = "summary"
inflections = find_inflections(word)
print(inflections)
# Output: ['sum-up', 'drumhead', 'compendious', 'compact', 'succinct']
# What the Output should be: ["summaries", "summarize", "summarizing", "summarized"]
This probably isn't of any use to you, but may help someone else who finds this with a search -
If the aim is just to find the words, rather than specifically to use a machine-learning approach to the problem, you could try using a regular expression (regex).
w3 schools seems to cover enough to get the result you want here or there is a more technical overview on python.org
to search case insensitively for the specific words you listed the following would work:
import re
string = "A SUMMARY ON SUMMATION:" \
"We use summaries to summarize. This action is summarizing. " \
"Once the action is complete things have been summarized."
occurrences = re.findall("summ[a-zA-Z]*", string, re.IGNORECASE)
print(occurrences)
However, depending on your precise needs you may need to modify the regular expression as this would also find words like 'summer' and 'summon'.
I'm not very good at regex but they can be a powerful tool if you know precisely what you are looking for and spend a little time crafting the right expression.
Sorry this probably isn't relevant to your circumstance but good luck.
A function in my program finds the definitions of certain vocabulary words, which will be useful for other parts of my program. However, it seems that not every vocabulary word is present in wordnet.
I find the definitions as follows:
y = wn.synset(w + '.n.01').definition()
where 'w' is one of the many vocabulary words being fed from a list (did not include rest of the program because it has too much irrelevant code). When the list reaches the term 'ligase', however, the following error comes up:
line 1298, in synset
raise WordNetError(message % (lemma, pos))
nltk.corpus.reader.wordnet.WordNetError: no lemma 'ligase' with part of speech 'n'
Is there anyway to bypass this or a different way to find the definition of these terms not in wordnet? My program is going through various scientific terms, so this may occur more often as I add more words to the list.
You should not make an assumption that a word is known to WordNet. Check if there are any relevant synsets, and ask for a definition only if there is at least one:
for word in ["ligase", "world"]: # Your word list
ss = wn.synsets(word)
if ss:
definition = ss[0].definition()
print("{}: {}".format(word, definition))
else:
print("### Word {} not found".format(word))
#### Word ligase not found
#world: everything that exists anywhere
there are several words that use "-ing" as present continuous like "shining". but when I try to lemmatize "shining" using nltk, it changes into "shin". the code is this:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word = "shining"
newlemma = wordnet_lemmatizer.lemmatize(word,'v')
print newlemma
even without using 'v', it still the same "shining" and doesn't change.
I'm expecting output "shine".
anybody can help? thanks
Because of the way WordNet applies rules and exception lists when searching for the root form.
It has a list of rules particularly to remove word endings, for instance:
"ing" -> ""
"ing" -> "e"
It applies the rules and see if the resulting word form exists in WordNet. So for instance, with mining, it would try min and not find anything. Then it would try mine (second rule) and find mine is valid word and return it. But with shining, it likely tries shin, finds shin in the list of valid words and believes this to be the proper root, so it returns it.
The speech recognition software that I'm using gives less than optimal results.
Eg: session is returned as fashion or mission.
Right now I have a dictionary like:
matches = {
'session': ['fashion', 'mission'],
...
}
and I am looping over all the words to find a match.
I do not mind false positives as the application accepts only a limited set of keywords. However it is tedious to manually enter new words for each of them. Also, the the speech recognizer comes up with new words every time I speak.
I am also running into difficulties where a long word is returned as a group of smaller words, so the above approach won't work.
So, is there an in-built method in nltk to do this? Or even a better algorithm that I could write myself?
You may want to look into python-Levenshtein. It's a python C extension module for calculating string distances/similarities.
Something like this silly inefficient code might work:
from Levenshtein import jaro_winkler # May not be module name
heard_words = "brain"
possible_words = ["watermelon", "brian"]
word_scores = [jaro-winkler(heard_word, possible) for possible in possible_words]
guessed_word = possible_words[word_scores.index(max(word_scores))]
print('I heard {0} and guessed {1}'.format(heard_word, guessed_word))
Here's the documentation and a non-maintained repo.
You can use the fuzzywuzzy,a python package for fuzzy matching of words and strings.
To install the package.
pip install fuzzywuzzy
Sample code related to your question.
from fuzzywuzzy import fuzz
MIN_MATCH_SCORE = 80
heard_word = "brain"
possible_words = ["watermelon", "brian"]
guessed_word = [word for word in possible_words if fuzz.ratio(heard_word, word) >= MIN_MATCH_SCORE]
print 'I heard {0} and guessed {1}'.format(heard_word, guessed_word)
Here is the documentation and repo of the fuzzywuzzy.
I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words
WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.