How to find polysemy words from input query? - python

I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words

WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.

Related

Find all the variations (or tenses) of a word in Python

I would like to know how you would find all the variations of a word, or the words that are related or very similar the the original word in Python.
An example of the sort of thing I am looking for is like this:
word = "summary" # any word
word_variations = find_variations_of_word(word) # a function that finds all the variations of a word, What i want to know how to make
print(word_variations)
# What is should print out: ["summaries", "summarize", "summarizing", "summarized"]
This is just an example of what the code should do, i have seen other similar question on this same topic, but none of them were accurate enough, i found some code and altered it to my own, which kinda works, but now to way i would like it to.
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def find_inflections(word):
inflections = []
for synset in wordnet.synsets(word): # Find all synsets for the word
for lemma in synset.lemmas(): # Find all lemmas for each synset
inflected_form = lemma.name().replace("_", " ") # Get the inflected form of the lemma
if inflected_form != word: # Only add the inflected form if it's different from the original word
inflections.append(inflected_form)
return inflections
word = "summary"
inflections = find_inflections(word)
print(inflections)
# Output: ['sum-up', 'drumhead', 'compendious', 'compact', 'succinct']
# What the Output should be: ["summaries", "summarize", "summarizing", "summarized"]
This probably isn't of any use to you, but may help someone else who finds this with a search -
If the aim is just to find the words, rather than specifically to use a machine-learning approach to the problem, you could try using a regular expression (regex).
w3 schools seems to cover enough to get the result you want here or there is a more technical overview on python.org
to search case insensitively for the specific words you listed the following would work:
import re
string = "A SUMMARY ON SUMMATION:" \
"We use summaries to summarize. This action is summarizing. " \
"Once the action is complete things have been summarized."
occurrences = re.findall("summ[a-zA-Z]*", string, re.IGNORECASE)
print(occurrences)
However, depending on your precise needs you may need to modify the regular expression as this would also find words like 'summer' and 'summon'.
I'm not very good at regex but they can be a powerful tool if you know precisely what you are looking for and spend a little time crafting the right expression.
Sorry this probably isn't relevant to your circumstance but good luck.

Python - Is there an NLTK Corpus for english GB words?

I am in the process of learning Python and trying to create an anagram creator/solver in flask.
I'm using nltk and have a basic script set up which descrambles a group of letters and finds the word from the corpus. I know my method may not be perfect - remember I'm still learning what is available to do in Python - but it works in principle and I've created a similar script to find all the words within a group of letters.
My problem is that it only uses American English, so in the example below 'favro' becomes 'favor' which is the American spelling, but 'favrou' doesn't become 'favour' which is the British spelling.
import itertools
import nltk
from nltk.corpus import words
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
scramble = "favro"
sep = ""
for y in list(itertools.permutations(scramble, len(scramble))):
if (sep.join(y) in english_vocab):
print(sep.join(y))
Is there anything out there which distinguishes between American and British English?
I've tried to use 'enchant' and it works fine on the solver part, but when I try to create a list of words within a word it is incredibly slow. For example, when I try to find all the words within 'colours' nltk takes 0.08 seconds and enchant takes 2.5 seconds. This time difference increases as the number of letters increases, so enchant is not viable.
Any ideas?
Steve
If you're only using NLTK for a word corpus, you might just want to find a word list of British-English words and read that in instead.
Since we're talking anagram solvers, why not use the SOWPODS word list, as used in official Scrabble competitions?
For instance, https://raw.githubusercontent.com/jesstess/Scrabble/master/scrabble/sowpods.txt (warning: big file!) contains FAVOUR, so it should work for you.
EDIT: To elaborate on my comment, e.g.
from collections import defaultdict
ws = defaultdict(set)
for word in open("./sowpods.txt"):
word = word.lower().strip()
if word:
ws[frozenset(word)].add(word)
def find(word):
return ws[frozenset(word)]
print(find("wired"))
outputs
{'rewired', 'weirdie', 'wiredrew', 'dewier', 'weired', 'widder', 'wried', 'weird', 'weirded', 'weedier', 'wider', 'wired', 'weirder'}
in 0.7 seconds.
(Yes, that's a superset of the possible words, but it's easy to filter down. Another option that would avoid this is to use ''.join(sorted(word)) as the key for the dict.)

Cross-Lingual Word Sense Disambiguation

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)
I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories