Most related term to a given sentence , nltk word2vec

Most related term to a given sentence , nltk word2vec - python

Having a trained word2vec model, is there a way to check which word in its vocabulary is the most "related" to a whole sentence ?
i was looking for something similar to
model.wv.most_similar("the dog is on the table")
which could result in ["dog","table"]

The most_similar() method can take multiple words as input, ideally as the named parameter positive. (That's as in, "positive examples", to be contrasted with "negative examples" which can also be provided via the negative parameter, and are useful when asking most_similar() to solve analogy-problems.)
When it receives multiple words, it returns results that are closest to the average of all words provided. That might be somewhat related to a whole sentence, but such an average-of-all-word-vectors is a fairly weak way of summarizing a sentence.
The multiple words should be provided as a list of strings, not a raw string of space-delimited words. So, for example:
sims = model.wv.most_similar(positive=['the', 'dog', 'is', 'on', 'the', 'table'])

Related

Using BERT to extract most similar words instead of word2vec for labeling functions

I am fairly new to BERT, and I wanted to test both approaches of using word2vec and BERT to extract most_similar words to a given word to pattern match in my labeling functions
I am currently using snorkel, one labeling function looks as so:
#labeling_function()
def lf_find_good_synonyms(x):
good_synonyms = word_vectors.most_similar("good", topn=25)
good_list = syn_list(good_synonyms)
return POSITIVE if any(word in x.stemmed for word in good_list) else ABSTAIN
This function basically looks for the word "good" or any of it's similar words in a sentence (the sentences are stemmed so are the words as the function syn_list returns the stem of each similar word), if found, the function will simply label the sentence as POSITIVE.
The issue here is that my word vectors are based on word2vec, and it's an old approach, I was wondering if I could use BERT instead and will it improve the performance much, since labeling functions are allowed to be lousy?

words not available in corpus for Word2Vec training

I am totally new to Word2Vec. I want to find cosine similarity between word pairs in my data. My codes are as follows:
import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec(corpus_file="corpus.txt", sg=0, window =7, size=100, min_count=10, iter=4)
vocabulary = list(model.wv.vocab)
data=pd.read_csv("experiment.csv")
cos_similarity = model.wv.similarity(data['word 1'], data['word 2'])
The problem is some words in the data columns of my "experiment.csv" file: "word 1" and "word 2" are not present in the corpus file ("corpus.txt"). So this error is returned:
"word 'setosa' not in vocabulary"
What should I do to handle words that are not present in my input corpus? I want to assign words in my experiment that are not present in the input corpus the vector zero, but I am stuck how to do it.
Any ideas for my problems?

It's really easy to give unknown words the origin (all 'zero') vector:
word = data['word 1']
if word in model.wv:
vec = model[word]
else:
vec = np.zeros(100)
But, this is unlikely what you want. The zero vector can't be cosine-similarity compared to other vectors.
It's often better to simply ignore unknown words. If they were so rare that your training data didn't haven enough of them to create a vector, they can't contribute much to other analyses.
If they're still important, the best approach is to get more data, with realistic usage contexts, so they get meaningful vectors.
Another alternative is to use an algorithm, such as the word2vec variant FastText, which can always synthesize a guess-vector for any words that were out-of-vocabulary (OOV) based on the training data. It does this by learning word-vectors for word-fragments (charactewr n-grams), then assembling a vector for a new unknown word from those fragments. It's often better than random, because unknown words are often typos or variants of known words with which they share a lot of segments. But it's still not great, and for really odd strings, essentially returns a random vector.
Another tactic I've seen used, but wouldn't personally recommend, is to replace a lot of the words that would otherwise be ignored – such as those with fewer than min_count occurrences – with some single plug token, like say '<OOV>'. Then that synthetic token becomes a quite-common word, but gets an almost entirely meaningless: a random low-magnitude vector. (The prevalence of this fake word & noise-vector in training will tend to make other surrounding words' vectors worse or slower-to-train, compared to simply eliding the low-frequency words.) But then, when dealing with later unknown words, you can use this same '<OOV>' pseudoword's vector as a not-too-harmful stand-in.
But again: it's almost always better to do some combination of – (a) more data; (b) ignoring rare words; (c) using a algorithm like FastText which can synthesize better-than-nothing vectors – than collapse all unknown words to a single nonsense vector.

gensim function predict output words

I use the gensim library to create a word2vec model. It contains the function predict_output_words() which I understand as follows:
For example, I have a model that is trained with the sentence: "Anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy."
and then I use
model.predict_output_words(context_words_list=['Anarchism', 'does', 'not', 'offer', 'a', 'fixed', 'body', 'of', 'from', 'a', 'single', 'particular', 'world', 'view', 'instead', 'fluxing'], topn=10).
In this situation, could I get/predict the correct word or the omitted word 'doctrine'?
Is this the right way? Please explain this function in detail.

I am wondering if you have seen the documentation of predict_output_word?
Report the probability distribution of the center word given the
context words as input to the trained model.
To answer your specific question about the word 'doctrine' - it strongly depends if for the words you listed as your context one of the 10 most probable words is 'doctrine'. This means that it must occur relatively frequently in the corpus you use for training of the model. Also, since 'doctrine' does not seem to be one of the very often used words there is a high chance other words will have a higher probability of appearing in the context. Therefore, if you base only on the returned probability of the words given the context you may end up failing to predict 'doctrine' in this case.

python nltk keyword extraction from sentence

"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)

I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.

One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]

in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner

There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.

Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.