KeyError: word not in vocabulary" in word2vec - python

I wanted to convert some Japanese word to vector so that I can train the model for prediction. For that I downloaded pretrained models from Here.
import gensim
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from gensim import models
from janome.tokenizer import Tokenizer
w2v_model = models.KeyedVectors.load_word2vec_format(w2v_models_path)
t = Tokenizer()
# I am testing for some random string
sentence = "社名公開求人住宅手当・家賃補助制度がある企業在宅勤務・リモートワーク可能な求人テレワークコロナに負けるな!積極採用中の企業特集リモートワーク可能なWebデザイナー求人"
tokens = [x.surface for x in t.tokenize(sentence)]
vectors = [w2v_model[v] for v in tokens]
In the last line, I am getting KeyError: "word 'テレワークコロナ' not in vocabulary"
Is there anything wrong here?

If you get a "not in vocabulary" error, you can trust that the token (word/key) that you've requested isn't in that KeyedVectors model.
You can see the full list of words known to your model (in the order they are stored) with w2v_model.key_to_index. (Or, just quick peek at some range of 20 in the middle as sanity-check with Python ranged-access like w2v_model.key_to_index[500:520].)
Are you sure 'テレワークコロナ' (and any other string giving the same error) is a legitimate, common Japanese word? Might the tokenizer be failing in some way? Are most of the words the tokenizer returns in the model?
It looks like the site you've linked has just copied those sets-of-word-vectors from Facebook's FastText work (https://fasttext.cc/docs/en/crawl-vectors.html). And, you're just using the plain text "word2vec_format" lists of vectors, so you only have the exact words in that file, and not the full FastText model - which also models word-fragments, and can thus 'guess' vectors for unknown words. (These guesses aren't very good – like working out a word's possible meaning from word-roots – but are usually better than nothing.)
I don't know if that approach works well for Japanese, but you could try it. If you instead grab the .bin (rather than text) file, and load it using Gensim's FastText support – specifically the load_facebook_vectors() method. You'll then get a special kind of KeyedVectors (FastTextKeyedVectors) that will give you such guesses for unknown words, which might help for your purposes (or not).

Related

Adding domain specific vocabulary to Huggingface pretrained tokenizer

I am using the hugginface transformers package and am loading a pretrained OpenAIGPTTokenizer tokenizer from a vocab file and merges file.
I want to add some domain-specific words to the vocabulary of this pretrained tokenizer (words such as "sms", "idk" and "lol")
When I tokenise a word that is currently in the vocabulary, the resulting token has a </w> symbol appended which I assume means end-of-word.
>>>tokenizer.tokenize("hello there lol")
["hello</w>", "there</w>", "lo", "l</w>"]
However, if I try and add a token to the tokenizer, it doesn't have a </w> symbol at the end. Although, at least the tokenizer recognises that "lol" is its own token.
>>>tokenizer.tokenize("hello there lol")
["hello</w>", "there</w>", "lol"]
I don't think this is the correct way to go about extending the vocabulary. What should I do?
Thanks

words not available in corpus for Word2Vec training

I am totally new to Word2Vec. I want to find cosine similarity between word pairs in my data. My codes are as follows:
import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec(corpus_file="corpus.txt", sg=0, window =7, size=100, min_count=10, iter=4)
vocabulary = list(model.wv.vocab)
data=pd.read_csv("experiment.csv")
cos_similarity = model.wv.similarity(data['word 1'], data['word 2'])
The problem is some words in the data columns of my "experiment.csv" file: "word 1" and "word 2" are not present in the corpus file ("corpus.txt"). So this error is returned:
"word 'setosa' not in vocabulary"
What should I do to handle words that are not present in my input corpus? I want to assign words in my experiment that are not present in the input corpus the vector zero, but I am stuck how to do it.
Any ideas for my problems?
It's really easy to give unknown words the origin (all 'zero') vector:
word = data['word 1']
if word in model.wv:
vec = model[word]
else:
vec = np.zeros(100)
But, this is unlikely what you want. The zero vector can't be cosine-similarity compared to other vectors.
It's often better to simply ignore unknown words. If they were so rare that your training data didn't haven enough of them to create a vector, they can't contribute much to other analyses.
If they're still important, the best approach is to get more data, with realistic usage contexts, so they get meaningful vectors.
Another alternative is to use an algorithm, such as the word2vec variant FastText, which can always synthesize a guess-vector for any words that were out-of-vocabulary (OOV) based on the training data. It does this by learning word-vectors for word-fragments (charactewr n-grams), then assembling a vector for a new unknown word from those fragments. It's often better than random, because unknown words are often typos or variants of known words with which they share a lot of segments. But it's still not great, and for really odd strings, essentially returns a random vector.
Another tactic I've seen used, but wouldn't personally recommend, is to replace a lot of the words that would otherwise be ignored – such as those with fewer than min_count occurrences – with some single plug token, like say '<OOV>'. Then that synthetic token becomes a quite-common word, but gets an almost entirely meaningless: a random low-magnitude vector. (The prevalence of this fake word & noise-vector in training will tend to make other surrounding words' vectors worse or slower-to-train, compared to simply eliding the low-frequency words.) But then, when dealing with later unknown words, you can use this same '<OOV>' pseudoword's vector as a not-too-harmful stand-in.
But again: it's almost always better to do some combination of – (a) more data; (b) ignoring rare words; (c) using a algorithm like FastText which can synthesize better-than-nothing vectors – than collapse all unknown words to a single nonsense vector.

Word2Vec vocab results in just letters and symbols

I'm new to Word2Vec and I am trying to cluster words based on their similarity. To start I am using nltk to separate the sentences and then using the resulting list of sentences as the input into Word2Vec. However, when I print the vocab, it is just a bunch of letters, numbers and symbols rather than words. To be specific, an example of one of the letters is "< gensim.models.keyedvectors.Vocab object at 0x00000238145AB438>, 'L':"
# imports needed and logging
import gensim
from gensim.models import word2vec
import logging
import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
with open('C:\\Users\\Freddy\\Desktop\\Thesis\\Descriptions.txt','r') as f_open:
text = f_open.read()
arr = []
sentences = nltk.sent_tokenize(text) # this gives a list of sentences
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)
model = word2vec.Word2Vec(sentences, size = 300)
print(model.wv.vocab)
As the tutorial and the documentation for Word2Vec class suggests the constructor of the class requires list of lists of words as the first parameter (or iterator of iterators of words in general):
sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger
corpora,...
I believe before feeding in sentences into Word2Vec you need to use words_tokenize on each of the sentences changing the crucial line to:
sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]
TL;DR
You get letters as your "words" because Word2Vec treats strings corresponding to sentences as iterables containing words. Iterating over strings results in the sequence of letters. These letters are used as the basis for the model learning (instead of intended words).
As the ancient saying goes: trash in - trash out.

Statistical sentence suggestion model like spell checking

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
Introduction to N-grams
Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories