Statistical sentence suggestion model like spell checking - python

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.

What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
Introduction to N-grams
Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

Related

Using BERT to extract most similar words instead of word2vec for labeling functions

I am fairly new to BERT, and I wanted to test both approaches of using word2vec and BERT to extract most_similar words to a given word to pattern match in my labeling functions
I am currently using snorkel, one labeling function looks as so:
#labeling_function()
def lf_find_good_synonyms(x):
good_synonyms = word_vectors.most_similar("good", topn=25)
good_list = syn_list(good_synonyms)
return POSITIVE if any(word in x.stemmed for word in good_list) else ABSTAIN
This function basically looks for the word "good" or any of it's similar words in a sentence (the sentences are stemmed so are the words as the function syn_list returns the stem of each similar word), if found, the function will simply label the sentence as POSITIVE.
The issue here is that my word vectors are based on word2vec, and it's an old approach, I was wondering if I could use BERT instead and will it improve the performance much, since labeling functions are allowed to be lousy?

words not available in corpus for Word2Vec training

I am totally new to Word2Vec. I want to find cosine similarity between word pairs in my data. My codes are as follows:
import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec(corpus_file="corpus.txt", sg=0, window =7, size=100, min_count=10, iter=4)
vocabulary = list(model.wv.vocab)
data=pd.read_csv("experiment.csv")
cos_similarity = model.wv.similarity(data['word 1'], data['word 2'])
The problem is some words in the data columns of my "experiment.csv" file: "word 1" and "word 2" are not present in the corpus file ("corpus.txt"). So this error is returned:
"word 'setosa' not in vocabulary"
What should I do to handle words that are not present in my input corpus? I want to assign words in my experiment that are not present in the input corpus the vector zero, but I am stuck how to do it.
Any ideas for my problems?
It's really easy to give unknown words the origin (all 'zero') vector:
word = data['word 1']
if word in model.wv:
vec = model[word]
else:
vec = np.zeros(100)
But, this is unlikely what you want. The zero vector can't be cosine-similarity compared to other vectors.
It's often better to simply ignore unknown words. If they were so rare that your training data didn't haven enough of them to create a vector, they can't contribute much to other analyses.
If they're still important, the best approach is to get more data, with realistic usage contexts, so they get meaningful vectors.
Another alternative is to use an algorithm, such as the word2vec variant FastText, which can always synthesize a guess-vector for any words that were out-of-vocabulary (OOV) based on the training data. It does this by learning word-vectors for word-fragments (charactewr n-grams), then assembling a vector for a new unknown word from those fragments. It's often better than random, because unknown words are often typos or variants of known words with which they share a lot of segments. But it's still not great, and for really odd strings, essentially returns a random vector.
Another tactic I've seen used, but wouldn't personally recommend, is to replace a lot of the words that would otherwise be ignored – such as those with fewer than min_count occurrences – with some single plug token, like say '<OOV>'. Then that synthetic token becomes a quite-common word, but gets an almost entirely meaningless: a random low-magnitude vector. (The prevalence of this fake word & noise-vector in training will tend to make other surrounding words' vectors worse or slower-to-train, compared to simply eliding the low-frequency words.) But then, when dealing with later unknown words, you can use this same '<OOV>' pseudoword's vector as a not-too-harmful stand-in.
But again: it's almost always better to do some combination of – (a) more data; (b) ignoring rare words; (c) using a algorithm like FastText which can synthesize better-than-nothing vectors – than collapse all unknown words to a single nonsense vector.

Cross-Lingual Word Sense Disambiguation

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

How to find polysemy words from input query?

I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words
WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories