Gensim- KeyError: 'word not in vocabulary'

Gensim- KeyError: 'word not in vocabulary' - python

I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/
I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)
The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
def similar_products(v, n = 6): #similarity function
# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
I am calling the function using:
similar_products(model['100018'])
Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?

If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.
If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.
You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.
One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.
Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.
Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.

Related

How to not break differentiability with a model's output?

I have an autoregressive language model in Pytorch that generates text, which is a collection of sentences, given one input:
output_text = ["sentence_1. sentence_2. sentence_3. sentence_4."]
Note that the output of the language model is in the form of logits (probability over the vocabulary), which can be converted to token IDS or strings.
Some of these sentences need to go into another model to get a loss that should affect only those sentences:
loss1 = model2("sentence_2")
loss2 = model2("sentence_4")
loss_total = loss1+loss2
What is the correct way to break/split the generated text from the first model without breaking differentiability? That is, so the corresponding text (from above) will look like a pytorch tensor of tensors (in order to then use some of them in the next model):
"[["sentence_1."]
["sentence_2."]
["sentence_3."]
["sentence_4."]]
For example, Python's split(".") method will most likely break differentiability, but will allow me to take each individual sentence and insert it into the second model to get a loss.

Okay solved it. Posting answer for completion.
Since the output is in the form of logits, I can take the argmax to get the indices of each token. This should allow me to know where each period is (to know where the end of the sentence is). I can then split the sentences in the following way to maintain the gradients:
sentences_list = []
r = torch.rand(50) #imagine that this is the output logits (though instead of a tensor of values it will be a tensor of tensors)
period_indices = [10,30,49]
sentences_list.append(r[0:10])
sentences_list.append(r[10:30])
sentences_list.append(r[30:])
Now each element in sentences_list is a sentence, that I can send to another model to get a loss

How to get cosine similarity of word embedding from BERT model

I was interesting in how to get the similarity of word embedding in different sentences from BERT model (actually, that means words have different meanings in different scenarios).
For example:
sent1 = 'I like living in New York.'
sent2 = 'New York is a prosperous city.'
I want to get the cos(New York, New York)'s value from sent1 and sent2, even if the phrase 'New York' is same, but it appears in different sentence. I got some intuition from https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2
But I still do not know which layer's embedding I need to extract and how to caculate the cos similarity for my above example.
Thanks in advance for any suggestions!

Okay let's do this.
First you need to understand that BERT has 13 layers. The first layer is basically just the embedding layer that BERT gets passed during the initial training. You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding. For simplicity I'm going to only use the last hidden layer of BERT.
Here you're using two words: "New" and "York". You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted. In this case I'm going to treat it as two separate words and average the embedding that BERT produces.
This can be described in a few steps:
Tokenize the inputs
Determine where the tokenizer has word_ids for New and York (suuuuper important)
Pass through BERT
Average
Cosine similarity
First, what you need to import: from transformers import AutoTokenizer, AutoModel
Now we can create our tokenizer and our model:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()
Make sure to use the model in evaluation mode unless you're trying to fine tune!
Next we need to tokenize (step 1):
tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')
Step 2. Need to determine where the index of the words match
# This is where the "New" and "York" can be found in sent1
sent1_idxs = [4, 5]
sent2_idxs = [0, 1]
tok1_ids = [np.where(np.array(tok1.word_ids()) == idx) for idx in sent1_idxs]
tok2_ids = [np.where(np.array(tok2.word_ids()) == idx) for idx in sent2_idxs]
The above code checks where the word_ids() produced by the tokenizer overlap the word indices from the original sentence. This is necessary because the tokenizer splits rare words. So if you have something like "aardvark", when you tokenize it and look at it you actually get this:
In [90]: tokenizer.convert_ids_to_tokens( tokenizer('aardvark').input_ids)
Out[90]: ['[CLS]', 'a', '##ard', '##var', '##k', '[SEP]']
In [91]: tokenizer('aardvark').word_ids()
Out[91]: [None, 0, 0, 0, 0, None]
Step 3. Pass through BERT
Now we grab the embeddings that BERT produces across the token ids that we've produced:
with torch.no_grad():
out1 = model(**tok1)
out2 = model(**tok2)
# Only grab the last hidden state
states1 = out1.hidden_states[-1].squeeze()
states2 = out2.hidden_states[-1].squeeze()
# Select the tokens that we're after corresponding to "New" and "York"
embs1 = states1[[tup[0][0] for tup in tok1_ids]]
embs2 = states2[[tup[0][0] for tup in tok2_ids]]
Now you will have two embeddings. Each is shape (2, 768). The first size is because you have two words we're looking at: "New" and "York. The second size is the embedding size of BERT.
Step 4. Average
Okay, so this isn't necessarily what you want to do but it's going to depend on how you treat these embeddings. What we have is two (2, 768) shaped embeddings. You can either compare New to New and York to York or you can combine New York into an average. I'll just do that but you can easily do the other one if it works better for your task.
avg1 = embs1.mean(axis=0)
avg2 = embs2.mean(axis=0)
Step 5. Cosine sim
Cosine similarity is pretty easy using torch:
torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1))
# tensor([0.6440])
This is good! They point in the same direction. They're not exactly 1 but that can be improved in several ways.
You can fine tune on a training set
You can experiment with averaging different layers rather than just the last hidden layer like I did
You can try to be creative in combining New and York. I took the average but maybe there's a better way for your exact needs.

How to build a content-based recommender system that uses multiple attributes?

I want to build a content-based recommender system in Python that uses multiple attributes to decide whether two items are similar. In my case, the "items" are packages hosted by the C# package manager (example) that have various attributes such as name, description, tags that could help to identify similar packages.
I have a prototype recommender system here that currently uses only a single attribute, the description, to decide whether packages are similar. It computes TF-IDF rankings for the descriptions and prints out the top 10 recommendations based on that:
# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
def train(dataframe):
tfidf = TfidfVectorizer(analyzer='word',
ngram_range=(1, 3),
min_df=0,
stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
for idx, row in dataframe.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
for i in similar_indices]
id = row['id']
similar_items = [it for it in similar_items if it[0] != id]
# This 'sum' is turns a list of tuples into a single tuple:
# [(1,2), (3,4)] -> (1,2,3,4)
flattened = sum(similar_items, ())
try_print("Top 10 recommendations for %s: %s" % (id, flattened))
How can I combine cosine_similarities with other similarity measures (based on same author, similar names, shared tags, etc.) to give more context to my recommendations?

For some context, my work with content-based recommenders has revolved primarily around raw text and categorical data/features. Here's a high-level approach I've taken that has worked out nicely and is pretty simple to implement.
Suppose I have three feature columns that I can potentially use to make recommendations: description, name, and tags. To me, the path of least resistance entails combining these three feature sets in a useful way.
You're off to a good start, using TF-IDF to encode description. So why not treat name and tags in a similar way by creating a feature "corpus" consisting of description, name, and tags? Literally, this would mean concatenating the contents of each of the three columns into one long text column.
Be wise about the concatenation, though, as it's probably to your advantage to preserve from which column a given word comes from, in the case of features like name and tag, which are assumed to have much lower cardinality than description. To put it more explicitly: instead of just creating your corpus column like this:
df['corpus'] = (pd.Series(df[['description', 'name', 'tags']]
.fillna('')
.values.tolist()
).str.join(' ')
You might try preserving information about where particular data points in name and tags come from. Something like this:
df['name_feature'] = ['name_{}'.format(x) for x in df['name']]
df['tags_feature'] = ['tags_{}'.format(x) for x in df['tags']]
And after you do that, I would take things a step further by considering how the default tokenizer (which you're using above) works in TfidfVectorizer. Suppose you have the name of a given package's author: "Johnny 'Lightning' Thundersmith". If you just concatenate that literal string, the tokenizer will split it up and roll each of "Johnny", "Lightning", and "Thundersmith" into separate features, which could potentially diminish the information added by that row's value for name. I think it's best to try to preserve that information. So I would do something like this to each of your lower-cardinality text columns (e.g. name or tags):
def raw_text_to_feature(s, sep=' ', join_sep='x', to_include=string.ascii_lowercase):
def filter_word(word):
return ''.join([c for c in word if c in to_include])
return join_sep.join([filter_word(word) for word in text.split(sep)])
def['name_feature'] = df['name'].apply(raw_text_to_feature)
The same sort of critical thinking should be applied to tags. If you've got a comma-separated "list" of tags, you'll probably have to parse those individually and figure out the right way to use them.
Ultimately, once you've got all of your <x>_feature columns created, then you can create your final "corpus" and plug that into your recommender system as inputs.
This whole system takes some engineering, to be sure, but I've found it's the easiest way to introduce new information from other columns that have different cardinalities.

As I understand your question, there are two ways this can be done:
Combine the other features with tfidf_matrix and then calculate the cosine similarity
Calculate the similarity of other features using other methods and then somehow combine them with the cosine similarity of tfidf_matrix to get a meaningful metric.
I was talking about the first one.
For example lets say, for your data, the tfidf_matrix (for only the 'description' column) is of shape (3000, 4000)
where 3000 are the rows in the data and 4000 are the unique words (vocabulary) found by the TfidfVectorizer.
Now lets say you do some feature processing on the other columns ('authors', 'id' etc) and that produces 5 columns. So the shape of that data is (3000, 5).
I was saying to combine the two matrices (combine the columns) so that the new shape of your data is (3000, 4005) and then calculate the cosine_similarity.
See below example:
from scipy import sparse
# This is your original matrix
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
# This is the other features
other_matrix = some_processing_on_other_columns()
combined_matrix = sparse.hstack((tfidf_matrix, other_matrix))
cosine_similarities = linear_kernel(combined_matrix, combined_matrix)

You have a vector for a user $\gamma_u$ and an item $\gamma_i$. The scoring function for your recommendation is:
Right now you said your feature vector has only 1 item, but once you get more, this model will scale for that.
In this case you already engineered your vectors, but typically in recommenders, the feature are learned through matrix factorization. This is called a latent factor model, whereas you have a hand-crafted model.

Gensim Doc2Vec model only generates a limited number of vectors

I am using gensim Doc2Vec model to generate my feature vectors. Here is the code I am using (I have explained what my problem is in the code):
cores = multiprocessing.cpu_count()
# creating a list of tagged documents
training_docs = []
# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
# 'doc' is in unicode format and I have already preprocessed it
training_docs.append(TaggedDocument(doc.split(), str(index+1)))
# at this point, I have 53 strings in my 'training_docs' list
model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)
# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10
I am just wondering if I am doing a mistake or if there is any other parameter that I should set?
UPDATE: I was playing with the tags parameter in TaggedDocument, and when I changed it to a mixture of text and numbers like: Doc1, Doc2, ... I see a different number for the count of generated vectors, but still I do not have the same number of feature vectors as expected.

Look at the actual tags it has discovered in your corpus:
print(model.docvecs.offset2doctag)
Do you see a pattern?
The tags property of each document should be a list of tags, not a single tag. If you supply a simple string-of-an-integer, it will see it as a list-of-digits, and thus only learn the tags '0', '1', ..., '9'.
You could replace str(index+1) with [str(index+1)] and get the behavior you were expecting.
But, since your document IDs are just ascending integers, you can also just use plain Python ints as your doctags. This will save some memory, buy avoiding the creation of a lookup dict from string-tag to array-slot (int). To do this, replace the str(index+1) with [index]. (This starts the doc-IDs from 0 – which is a teensy bit more Pythonic, and also avoids wasting an unused 0 position in the raw array that holds the trained vectors.)

Matching words and vectors in gensim Word2Vec model

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Problem:
Match words to embedding vectors created by Word2Vec () -- how do I do it?
My approach:
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.
If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:
import gensim
documents = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)
which outputs:
[('survey', 1.0000001192092896)]
where the number represents the similarity.
However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.

So I found an easy way to do this, where nmodel is the name of your model.
#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)
#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]
This is based on the gensim code. For older versions of gensim, you might need to drop the wv after the model.

As #bpachev mentioned, gensim does have an option of searching by vector, namely similar_by_vector.
It however implements a brute force linear search, i.e. computes cosine similarity between given vector and vectors of all words in vocabulary, and gives off the top neighbours. An alternate option, as mentioned in the other answer is to use an approximate nearest neighbour search algorithm like FLANN.
Sharing a gist demonstrating the same:
https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.