Cross-Lingual Word Sense Disambiguation - python

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

Related

Need help in constructing pattern for OR and AND using spacy python

Let's say suppose I have a text saying
input sentence : Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it.
and I have to find if any of the phrases matches the given text sentences.
or_phrases = [efficient, design]
expected output - yes, because the above input sentence has a word "efficient"
and_phrase = [love, live]
expected output : None. Because the above input sentence doesn't have love or live anywhere in the entire sentence. Order of the words doesn't matter. To convert this into a reg expr:
re.match('(?=.*love)|(?=.*live)'
Looking to put this rule into spacy's phrase or token matcher
Is there a way to put this grammar pattern into a spacy pattern matcher?
or_phrases should give me sentences having either of the words.
and_phrases should give me sentences having both the words.
With spacy you can easily make an OR phrase by using 'TEXT': {'IN':["word1", "word2"]}. In your example this will look like this:
text = """Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. """
doc = nlp(text)
matcher = Matcher(nlp.vocab)
or_pattern = [[
{"TEXT": {"IN": ["process", "random"]}} # the list of "OR" words
]]
matcher.add("or_phrases", or_pattern)
for sent in doc.sents:
matches = matcher(sent)
if len(matches) > 0:
print(sent)
I still can't come up with an easy solution for AND but I'll try

NLTK wordnet does not contain vocabulary term - Python

A function in my program finds the definitions of certain vocabulary words, which will be useful for other parts of my program. However, it seems that not every vocabulary word is present in wordnet.
I find the definitions as follows:
y = wn.synset(w + '.n.01').definition()
where 'w' is one of the many vocabulary words being fed from a list (did not include rest of the program because it has too much irrelevant code). When the list reaches the term 'ligase', however, the following error comes up:
line 1298, in synset
raise WordNetError(message % (lemma, pos))
nltk.corpus.reader.wordnet.WordNetError: no lemma 'ligase' with part of speech 'n'
Is there anyway to bypass this or a different way to find the definition of these terms not in wordnet? My program is going through various scientific terms, so this may occur more often as I add more words to the list.
You should not make an assumption that a word is known to WordNet. Check if there are any relevant synsets, and ask for a definition only if there is at least one:
for word in ["ligase", "world"]: # Your word list
ss = wn.synsets(word)
if ss:
definition = ss[0].definition()
print("{}: {}".format(word, definition))
else:
print("### Word {} not found".format(word))
#### Word ligase not found
#world: everything that exists anywhere

Statistical sentence suggestion model like spell checking

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
Introduction to N-grams
Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

How to find polysemy words from input query?

I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words
WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories