Clarification on how word2vec `generate_batch()` works? - python

I have been trying to understand how it works to apply it to my test and dataset (I find that tensorflow code on github is too complex and not very straightforward).
I will be using a skip-gram model. This is the code that I wrote. I'd like a non cryptic explanation of what's going on and what I need to do to make this work.
def generate_batch(self):
inputs = []
labels = []
for i,phrase in enumerate(self.training_phrases): # training_phrases look like this: ['I like that cat', '...', ..]
array_list = utils.skip_gram_tokenize(phrase) # This transforms a sentence into an array of arrays of numbers representing the sentence, ex. [[181, 152], [152, 165], [165, 208], [208, 41]]
for array in array_list:
inputs.append(array) # I noticed that this is useless, I could just do inputs = array_list
return inputs, labels
This is where I am right now. From the generate_batch() that tensorflow provides on github, I can see that it returns inputs, labels .
I assume that inputs is the array of skip grams, but what is labels? How do I generate them?
Also, I saw that it implements batch_size, how can I do that (I assume I have to split the data in smaller pieces, but how does that work? I put the data into an array?).
Regarding batch_size, what happens if the batch size is 16, but the data offers only 130 inputs? Do I do 8 regular batches and then a minibatch of 2 inputs?

For skip-gram you need to feed the input-label pair as the current word and its context word. The context word for each input word is defined within a window of the text phrases.
Consider the following text phrase: "Here's looking at you kid". For a window of 3, and for the current word at, you have two context words looking and you. So the input label pairs are {at, looking}, {at, you}, which you convert them into a number representation.
In the above code, the array list example is given as: ex. [[181, 152], [152, 165], [165, 208], [208, 41]], which means current word and its context is defined for the next word and not for the previous word.
The architecture looks something like below:
Now you have these pairs generated, get them in batches and train them. Its ok to have uneven size batches, but make sure that your loss is average loss and not a sum loss.

Related

Gensim- KeyError: 'word not in vocabulary'

I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/
I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)
The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
def similar_products(v, n = 6): #similarity function
# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
I am calling the function using:
similar_products(model['100018'])
Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?
If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.
If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.
You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.
One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.
Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.
Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.

How to not break differentiability with a model's output?

I have an autoregressive language model in Pytorch that generates text, which is a collection of sentences, given one input:
output_text = ["sentence_1. sentence_2. sentence_3. sentence_4."]
Note that the output of the language model is in the form of logits (probability over the vocabulary), which can be converted to token IDS or strings.
Some of these sentences need to go into another model to get a loss that should affect only those sentences:
loss1 = model2("sentence_2")
loss2 = model2("sentence_4")
loss_total = loss1+loss2
What is the correct way to break/split the generated text from the first model without breaking differentiability? That is, so the corresponding text (from above) will look like a pytorch tensor of tensors (in order to then use some of them in the next model):
"[["sentence_1."]
["sentence_2."]
["sentence_3."]
["sentence_4."]]
For example, Python's split(".") method will most likely break differentiability, but will allow me to take each individual sentence and insert it into the second model to get a loss.
Okay solved it. Posting answer for completion.
Since the output is in the form of logits, I can take the argmax to get the indices of each token. This should allow me to know where each period is (to know where the end of the sentence is). I can then split the sentences in the following way to maintain the gradients:
sentences_list = []
r = torch.rand(50) #imagine that this is the output logits (though instead of a tensor of values it will be a tensor of tensors)
period_indices = [10,30,49]
sentences_list.append(r[0:10])
sentences_list.append(r[10:30])
sentences_list.append(r[30:])
Now each element in sentences_list is a sentence, that I can send to another model to get a loss

How to get cosine similarity of word embedding from BERT model

I was interesting in how to get the similarity of word embedding in different sentences from BERT model (actually, that means words have different meanings in different scenarios).
For example:
sent1 = 'I like living in New York.'
sent2 = 'New York is a prosperous city.'
I want to get the cos(New York, New York)'s value from sent1 and sent2, even if the phrase 'New York' is same, but it appears in different sentence. I got some intuition from https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2
But I still do not know which layer's embedding I need to extract and how to caculate the cos similarity for my above example.
Thanks in advance for any suggestions!
Okay let's do this.
First you need to understand that BERT has 13 layers. The first layer is basically just the embedding layer that BERT gets passed during the initial training. You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding. For simplicity I'm going to only use the last hidden layer of BERT.
Here you're using two words: "New" and "York". You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted. In this case I'm going to treat it as two separate words and average the embedding that BERT produces.
This can be described in a few steps:
Tokenize the inputs
Determine where the tokenizer has word_ids for New and York (suuuuper important)
Pass through BERT
Average
Cosine similarity
First, what you need to import: from transformers import AutoTokenizer, AutoModel
Now we can create our tokenizer and our model:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()
Make sure to use the model in evaluation mode unless you're trying to fine tune!
Next we need to tokenize (step 1):
tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')
Step 2. Need to determine where the index of the words match
# This is where the "New" and "York" can be found in sent1
sent1_idxs = [4, 5]
sent2_idxs = [0, 1]
tok1_ids = [np.where(np.array(tok1.word_ids()) == idx) for idx in sent1_idxs]
tok2_ids = [np.where(np.array(tok2.word_ids()) == idx) for idx in sent2_idxs]
The above code checks where the word_ids() produced by the tokenizer overlap the word indices from the original sentence. This is necessary because the tokenizer splits rare words. So if you have something like "aardvark", when you tokenize it and look at it you actually get this:
In [90]: tokenizer.convert_ids_to_tokens( tokenizer('aardvark').input_ids)
Out[90]: ['[CLS]', 'a', '##ard', '##var', '##k', '[SEP]']
In [91]: tokenizer('aardvark').word_ids()
Out[91]: [None, 0, 0, 0, 0, None]
Step 3. Pass through BERT
Now we grab the embeddings that BERT produces across the token ids that we've produced:
with torch.no_grad():
out1 = model(**tok1)
out2 = model(**tok2)
# Only grab the last hidden state
states1 = out1.hidden_states[-1].squeeze()
states2 = out2.hidden_states[-1].squeeze()
# Select the tokens that we're after corresponding to "New" and "York"
embs1 = states1[[tup[0][0] for tup in tok1_ids]]
embs2 = states2[[tup[0][0] for tup in tok2_ids]]
Now you will have two embeddings. Each is shape (2, 768). The first size is because you have two words we're looking at: "New" and "York. The second size is the embedding size of BERT.
Step 4. Average
Okay, so this isn't necessarily what you want to do but it's going to depend on how you treat these embeddings. What we have is two (2, 768) shaped embeddings. You can either compare New to New and York to York or you can combine New York into an average. I'll just do that but you can easily do the other one if it works better for your task.
avg1 = embs1.mean(axis=0)
avg2 = embs2.mean(axis=0)
Step 5. Cosine sim
Cosine similarity is pretty easy using torch:
torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1))
# tensor([0.6440])
This is good! They point in the same direction. They're not exactly 1 but that can be improved in several ways.
You can fine tune on a training set
You can experiment with averaging different layers rather than just the last hidden layer like I did
You can try to be creative in combining New and York. I took the average but maybe there's a better way for your exact needs.

How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)

I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.
The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.
I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.
I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indicies then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
vocab, embed = load_pretrained_glove()
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value = 0)
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
embedding_init = tf.Variable(tf.constant(np.asarray(embed),
dtype=tf.float32),
trainable=trainable,
name="embed_init")
# return the word embedded version of the sentence (300d vectors/word)
return tf.nn.embedding_lookup(embedding_init, string_tensor)
The code example below adapts your embed_tensor function such that words are embedded as follows:
For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.
import tensorflow as tf
import numpy as np
EMB_DIM = 300
def load_pretrained_glove():
return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)
def get_train_vocab():
return ["a", "dog", "sat", "on", "the", "mat"]
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indices then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
pretrained_vocab, pretrained_embs = load_pretrained_glove()
train_vocab = get_train_vocab()
only_in_train = list(set(train_vocab) - set(pretrained_vocab))
vocab = pretrained_vocab + only_in_train
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value=len(vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
train_embeddings = tf.get_variable(
name="embs_only_in_train",
shape=[len(only_in_train), EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)
return tf.nn.embedding_lookup(embeddings, string_tensor)
FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.
I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.
The index_table_from_tensor method accepts a num_oov_buckets parameter that shuffles all your oov words into a predefined number of buckets.
If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).
So,
if (at each lookup) you set (i.e. assign) the last rows (those corresponding to the buckets) of your embedding_init Variable to a random value
if you make num_oov_bucketsenough large that collisions will be minimized
you can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.
The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.
An idea I had for this was to capture the new words to the pre-trained embedding by adding a new dimension for each new word (basically maintaining the one-hot nature of them).
Assuming the number of new words is small but they're important, you could for instance increase the dimensions of your embedded results from 300 to 300 + # of new words where each new word would get all zeros except 1 in it's dimension.

Working with variable-length text in Tensorflow

I am building a Tensorflow model to perform inference on text phrases.
For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.
data = ['hello',
'my name is Mark',
'What is your name?']
My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:
data = [[1],
[2, 3, 4, 5],
[6, 4, 7, 3]
What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data?
Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.
-- EDIT --
Here is a concrete example.
When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:
input = tf.placeholder(tf.int32, shape=(None, 3)
with tf.Session() as sess:
print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))
where the first dimension of the placeholder is the minibatch size.
What if the input sequences are words in sentences of different length?
feed_dict={input:[[1, 2, 3], [1]]}
The other two answers are correct, but low on details. I was just looking at how to do this myself.
There is machinery in TensorFlow to to all of this (for some parts it may be overkill).
Starting from a string tensor (shape [3]):
import tensorflow as tf
lines = tf.constant([
'Hello',
'my name is also Mark',
'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']
The first thing to do is split this into words (note the space before the question mark.)
words = tf.string_split(lines," ")
Words will now be a sparse tensor (shape [3,7]). Where the two dimensions of the indices are [line number, position]. This is represented as:
indices values
0 0 'hello'
1 0 'my'
1 1 'name'
1 2 'is'
...
Now you can do a word lookup:
table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)
This returns a sparse tensor with the words replaced by their vocabulary indices.
Now you can read out the sequence lengths by looking at the maximum position on each line :
line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position,
segment_ids = line_number)+1
So if you're processing variable length sequences it's probably to put in an lstm ... so let's use a word-embedding for the input (it requires a dense input):
EMBEDDING_DIM = 100
dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)
Now embedded will have a shape of [3,7,100], [lines, words, embedding_dim].
Then a simple lstm can be built:
LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)
And run the across the sequence, handling the padding.
outputs, final_state = tf.nn.dynamic_rnn(
cell=lstm,
inputs=embedded,
sequence_length=lengths,
dtype=tf.float32)
Now outputs has a shape of [3,7,50], or [line,word,lstm_size]. If you want to grab the state at the last word of each line you can use the (hidden! undocumented!) select_last_activations function:
from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))
That does all the index shuffling to select the output from the last timestep. This gives a size of [3,50] or [line, lstm_size]
init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
init_t.run()
init.run()
print(final_output.eval().shape())
I haven't worked out the details yet but I think this could probably all be replaced by a single tf.contrib.learn.DynamicRnnEstimator.
How about this? (I didn’t implement this. but maybe this idea will work.)
This method is based on BOW representation.
Get your data as tf.string
Split it using tf.string_split
Find indexes of your words using tf.contrib.lookup.string_to_index_table_from_file or tf.contrib.lookup.string_to_index_table_from_tensor. Length of this tensor can vary.
Find embeddings of your indexes.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`
Sum up the embeddings. And you will get a tensor of fixed length(=embedding size). Maybe you can choose another method then sum.(avg, mean or something else)
Maybe it’s too late :) Good luck.
I was building a sequence to sequence translator the other day. What I did is decided to do was make it for a fixed length of 32 words (which was a bit above the average sentence length) although you can make it as long as you want. I then added a NULL word to the dictionary and padded all my sentence vectors with it. That way I could tell the model where the end of my sequence was and the model would just output NULL at the end of its output. For instance take the expression "Hi what is your name?" This would become "Hi what is your name? NULL NULL NULL NULL ... NULL". It worked pretty well but your loss and accuracy during training will appear a bit higher than it actually is since the model usually gets the NULLs right which count towards the cost.
There is another approach called masking. This too allows you to build a model for a fixed length sequence but only evaluate the cost up to the end of a shorter sequence. You could search for the first instance of NULL in the output sequence (or expected output, whichever is greater) and only evaluate the cost up to that point. Also I think some tensor flow functions like tf.dynamic_rnn support masking which may be more memory efficient. I am not sure since I have only tried the first approach of padding.
Finally, I think in the tensorflow example of Seq2Seq model they use buckets for different sized sequences. This would probably solve your memory issue. I think you could share the variables between the different sized models.
So here is what I did (not sure if its 100% the right way to be honest):
In your vocab dict where each key is a number pointing to one particular word, add another key say K which points to "<PAD>"(or any other representation you want to use for padding)
Now your placeholder for input would look something like this:
x_batch = tf.placeholder(tf.int32, shape=(batch_size, None))
where None represents the largest phrase/sentence/record in your mini batch.
Another small trick I used was to store the length of each phrase in my mini batch. For example:
If my input was: x_batch = [[1], [1,2,3], [4,5]]
then I store: len_batch = [1, 3, 2]
Later I use this len_batch and the max size of a phrase(l_max) in my minibatch to create a binary mask. Now l_max=3 from above, so my mask would look something like this:
mask = [
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]
]
Now if you multiply this with your loss you would basically eliminate all loss introduced as a result of padding.
Hope this helps.

Categories