Word2Vec empty word not in vocabulary - python

I'm currently required to work on a multilingual text classification model where I have to classify whether two sentences in two languages are semantically similar. I'm also required to use Word2Vec for word embedding.
I am able to generate the word embedding using Word2Vec, however, when I'm trying to convert my sentences to vectors with a method similar to this. I get an error saying
KeyError: "word '' not in vocabulary"
Here is my code snippet
import nltk
nltk.download('punkt')
tokenized_text_data = [nltk.word_tokenize(sub) for sub in concatenated_text]
model = Word2Vec(sentences=tokenized_text_data, min_count=1)
# Error happens here
train_vectors = [model.wv[re.split(" |;", row)] for row in concatenated_text]
For context, concatenated_text is the sentences from two languages concatenated together with semi-colon as the delimiter. Hence, why the function re.split(" |;").
I guess the important thing now is to understand why the error is telling me that an empty string '' is not in the vocabulary.
I did not provide the sentences cause the dataset is too big and I can't seem to find which word of which sentence is producing this error.

It turns out it was because of the delimiter that I concatenated myself all along. There are other semicolons in the sentence dataset, and with how re.split(" |;") works, it will split the sentence such as ice cream ; bread ; milk into a list of ['ice', 'cream', '', '', 'bread', '', '', 'milk']. Hence why the error word '' not in vocabulary.
I hope this would benefit someone in the future!

Related

KeyError: word not in vocabulary" in word2vec

I wanted to convert some Japanese word to vector so that I can train the model for prediction. For that I downloaded pretrained models from Here.
import gensim
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from gensim import models
from janome.tokenizer import Tokenizer
w2v_model = models.KeyedVectors.load_word2vec_format(w2v_models_path)
t = Tokenizer()
# I am testing for some random string
sentence = "社名公開求人住宅手当・家賃補助制度がある企業在宅勤務・リモートワーク可能な求人テレワークコロナに負けるな!積極採用中の企業特集リモートワーク可能なWebデザイナー求人"
tokens = [x.surface for x in t.tokenize(sentence)]
vectors = [w2v_model[v] for v in tokens]
In the last line, I am getting KeyError: "word 'テレワークコロナ' not in vocabulary"
Is there anything wrong here?
If you get a "not in vocabulary" error, you can trust that the token (word/key) that you've requested isn't in that KeyedVectors model.
You can see the full list of words known to your model (in the order they are stored) with w2v_model.key_to_index. (Or, just quick peek at some range of 20 in the middle as sanity-check with Python ranged-access like w2v_model.key_to_index[500:520].)
Are you sure 'テレワークコロナ' (and any other string giving the same error) is a legitimate, common Japanese word? Might the tokenizer be failing in some way? Are most of the words the tokenizer returns in the model?
It looks like the site you've linked has just copied those sets-of-word-vectors from Facebook's FastText work (https://fasttext.cc/docs/en/crawl-vectors.html). And, you're just using the plain text "word2vec_format" lists of vectors, so you only have the exact words in that file, and not the full FastText model - which also models word-fragments, and can thus 'guess' vectors for unknown words. (These guesses aren't very good – like working out a word's possible meaning from word-roots – but are usually better than nothing.)
I don't know if that approach works well for Japanese, but you could try it. If you instead grab the .bin (rather than text) file, and load it using Gensim's FastText support – specifically the load_facebook_vectors() method. You'll then get a special kind of KeyedVectors (FastTextKeyedVectors) that will give you such guesses for unknown words, which might help for your purposes (or not).

Creating a parallel corpus from list of words and list of sentences (Python)

I'm trying to create a parallel corpus for supervised machine learning.
Essentially I want to have two files, one with one full sentences per line and the other one with only specific manually extracted terms that correspond to the sentence on the same line.
I have already create the file with one sentence per line; now I would like to generate the labels file with the terms in each line. For illustration, this is the code I came up with:
import re
list_of_terms = ["cake", "cola", "water", "stop"]
sentences = ["Let's eat some cake.", "I'd like to have some cola to go with the cake.", "stop eating all this cake, you waterstopper", "I will never eat this again", "cake and cola and water"]
para = []
for line in sentences:
s = re.findall(r"(?=\b("+'|'.join(list_of_terms)+r")\b)", line)
para.append(s)
print(*para, sep = "\n")
This results in the output I want:
['cake']
['cola', 'cake']
['stop', 'cake']
[]
['cake', 'cola', 'water']
Unfortunately the code does not work very well for the corpora I'm dealing with. In fact, I'm faced with 3 different kinds of exception.
For one corpora the re.findall function always outputs and additional '' to each term.
[('criminal', ''), ('liability', ''), ('legal', ''), ('fiscal', ''), ('criminal', ''), ('law', '')]
I solved this thanks to the last comment in this thread: Use of findall and parenthesis in Python
[x if x!='' else y for x,y in re.findall(r"(?=\b("+'|'.join(list_of_terms)+r")\b)]
However, this method throws up a ValueError, as regex is not creating the '' for two other corpora I'm working with. For those I simply use a try except - block and run the sample code with satisfactory result. But why is regex not creating the '' in this case?
Finally, one other corpra raises an re.error "re.error: nothing to repeat at position 4950" and I have found no fix for this yet. I suspect there are special characters in the "list_of_terms"; any way to filter those beforehand?
Needless to say, I'm still quite new to coding as my background is translation and not computer science. So a graceful answer would be much appreciated! :)
P.S.: The corpora I am using are all in the the ACTER Corpus-Collection: https://github.com/AylaRT/ACTER
You need to re.escape each of the item in the list_of_terms list, and use unambiguous word boundaries:
re.findall(r"(?=(?<!\w)("+'|'.join(map(re.escape, list_of_terms))+r")(?!\w))", line)
The (?<!\w) negative lookbehind matches a location that is not immediately preceded with a word char (digit, letter or _).
The (?!\w) negative lookahead matches a location that is not immediately followed with a word char.

Most related term to a given sentence , nltk word2vec

Having a trained word2vec model, is there a way to check which word in its vocabulary is the most "related" to a whole sentence ?
i was looking for something similar to
model.wv.most_similar("the dog is on the table")
which could result in ["dog","table"]
The most_similar() method can take multiple words as input, ideally as the named parameter positive. (That's as in, "positive examples", to be contrasted with "negative examples" which can also be provided via the negative parameter, and are useful when asking most_similar() to solve analogy-problems.)
When it receives multiple words, it returns results that are closest to the average of all words provided. That might be somewhat related to a whole sentence, but such an average-of-all-word-vectors is a fairly weak way of summarizing a sentence.
The multiple words should be provided as a list of strings, not a raw string of space-delimited words. So, for example:
sims = model.wv.most_similar(positive=['the', 'dog', 'is', 'on', 'the', 'table'])

Cross-Lingual Word Sense Disambiguation

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories