i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks
It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.
Related
I'm trying to calculate the embedding of a sentence using BERT. After I input the sentence into BERT, I calculate the Mean-pooling, which is used as the embedding of the sentence.
Problem
My code can calculate the embedding of sentences, but the computational cost is very high. I don't know what's wrong and I hope someone can help me.
Install BERT
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
Get Embedding Function
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[1]
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0]
Data
The maximum number of words in the text is 50. I calculate the entity+text embedding
enter image description here
Run code
entity_desc is my data.
It's this step that overloads my computer every time I run it.
Please help me!!!
I was use RAM 80GB machine in Colab.
entity_embedding = {}
for i in range(len(entity_desc)):
entity = entity_desc['entity'][i]
text = entity_desc['text'][i]
entity += ' ' + text
entity_embedding[entity_desc['entity_id'][i]] = get_word_embedding(entity)
I fixed the problem.
The reason for the memory overload was that I wasn't saving the tensor to the GPU, so I made the following changes to the code.
model = model.to(device)
import torch
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
input_ids = input_ids.to(device)
outputs = model(input_ids)
last_hidden_states = outputs[1]
last_hidden_states = last_hidden_states.to(device)
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0].detach().to(device)
You might be storing sentence embedding in the GPU. Try to move it to cpu before returning it.
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[1]
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0].detach().cpu()
I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.
To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.
Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.
I'm trying to grasp the idea of a Hierarchical Attention Network (HAN), most of the code i find online is more or less similar to the one here: https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f :
embedding_layer=Embedding(len(word_index)+1,EMBEDDING_DIM,weights=[embedding_matrix],
input_length=MAX_SENT_LENGTH,trainable=True)
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32', name='input1')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32', name='input2')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(len(macronum), activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)
My question is: What do the input layers here represent? I'm guessing that input1 represents the sentences wrapped with the embedding layer, but in that case what is input2? Is it the output of the sentEncoder? In that case it should be a float, or if it's another layer of embedded words, then it should be wrapped with an embedding layer as well.
The HAN model processes the text in a hierarchy: it takes a document already splitted into sentences (that's why the shape of input2 is (MAX_SENTS,MAX_SENT_LENGTH)); then it processes each sentence independently using sentEncoder model (that's why the shape of input1 is (MAX_SENT_LENGTH,)), and finally it processes all the encoded sentences together.
So in your code the whole model is stored in model and its input layer is input2 which you would fed with documents which have been splitted into sentences and their words have been integer encoded (to make it compatible with the embedding layer). The other input layer belongs to the sentEncoder model which is used inside the model (and not directly by you):
review_encoder = TimeDistributed(sentEncoder)(review_input)
Masoud's answer is correct but I'll rewrite it here in my own words:
The data (X_train) is fed as indexes to the model and is received by
input2
X_train is then forwarded to the encoder model and is received by
input1
input1 is wrapped by an embedding layer so the indexes are converted
to vectors
So input2 is more a proxy of the model's input.
I am trying to implement the type of character level embeddings described in this paper in Keras. The character embeddings are calculated using a bidirectional LSTM.
To recreate this, I've first created a matrix of containing, for each word, the indexes of the characters making up the word:
char2ind = {char: index for index, char in enumerate(chars)}
max_word_len = max([len(word) for sentence in sentences for word in sentence])
X_char = []
for sentence in X:
for word in sentence:
word_chars = []
for character in word:
word_chars.append(char2ind[character])
X_char.append(word_chars)
X_char = sequence.pad_sequences(X_char, maxlen = max_word_len)
I then define a BiLSTM model with an embedding layer for the word-character matrix. I assume the input_dimension will have to be equal to the number of characters. I want a size of 64 for my character embeddings, so I set the hidden size of the BiLSTM to 32:
char_lstm = Sequential()
char_lstm.add(Embedding(len(char2ind) + 1, 64))
char_lstm.add(Bidirectional(LSTM(hidden_size, return_sequences=True)))
And this is where I get confused. How can I retrieve the embeddings from the model? I'm guessing I would have to compile the model and fit it then retrieve the weights to get the embeddings, but what parameters should I use to fit it ?
Additional details:
This is for an NER task, so the dataset technically could be be anything in the word - label format, although I am specifically working with the WikiGold ConLL corpus available here: https://github.com/pritishuplavikar/Resume-NER/blob/master/wikigold.conll.txt
The expected output from the network are the labels (I-MISC, O, I-PER...)
I expect the dataset to be large enough to be training character embeddings directly from it. All words are coded with the index of their constituting characters, alphabet size is roughly 200 characters. The words are padded / cut to 20 characters. There are around 30 000 different words in the dataset.
I hope to be able learn embeddings for each characters based on the info from the different words. Then, as in the paper, I would concatenate the character embeddings with the word's glove embedding before feeding into a Bi-LSTM network with a final CRF layer.
I would also like to be able to save the embeddings so I can reuse them for other similar NLP tasks.
Generally speaking Keras approach to building models (even seemingly complex ones) is dead simple. For example, the kind of model you want to build would simply look like (note this is for binary classification problem):
model = Sequential()
model.add(Embedding(max_features, out_dims, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
This is no different from plain vanilla NN with the exception of having the Embedding and Bidirectional layers in place Dense layers. This is one of the things that makes Keras amazing.
Usually it's helpful to look for a working example (Keras has loads) that is doing more or less the same thing you are trying to do. In this case you could first look at this model and then "reverse engineer" the workings of it to answer your questions. Usually things come down to formatting the data in the right way, where a working example model works wonders as you can carefully investigate the data format its using.
I was trying to inject pretrained word2vec vectors into existing tensorflow seq2seq model.
Following this answer, I produced the following code. But it doesn't seem to improve performance as it should, although the values in the variable are updated.
In my understanding the error might be due to the fact that EmbeddingWrapper or embedding_attention_decoder create embeddings independently of the vocabulary order?
What would be the best way to load pretrained vectors into tensorflow model?
SOURCE_EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
TARGET_EMBEDDING_KEY = "embedding_attention_seq2seq/embedding_attention_decoder/embedding"
def inject_pretrained_word2vec(session, word2vec_path, input_size, dict_dir, source_vocab_size, target_vocab_size):
word2vec_model = word2vec.load(word2vec_path, encoding="latin-1")
print("w2v model created!")
session.run(tf.initialize_all_variables())
assign_w2v_pretrained_vectors(session, word2vec_model, SOURCE_EMBEDDING_KEY, source_vocab_path, source_vocab_size)
assign_w2v_pretrained_vectors(session, word2vec_model, TARGET_EMBEDDING_KEY, target_vocab_path, target_vocab_size)
def assign_w2v_pretrained_vectors(session, word2vec_model, embedding_key, vocab_path, vocab_size):
vectors_variable = [v for v in tf.trainable_variables() if embedding_key in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many. key: " + embedding_key)
print("Existing embedding trainable variables:")
print([v.name for v in tf.trainable_variables() if "embedding" in v.name])
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
with gfile.GFile(vocab_path, mode="r") as vocab_file:
counter = 0
while counter < vocab_size:
vocab_w = vocab_file.readline().replace("\n", "")
# for each word in vocabulary check if w2v vector exist and inject.
# otherwise dont change the value.
if word2vec_model.__contains__(vocab_w):
w2w_word_vector = word2vec_model.get_vector(vocab_w)
vectors[counter] = w2w_word_vector
counter += 1
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I am not familiar with the seq2seq example, but in general you can use the following code snippet to inject your embeddings:
Where you build you graph:
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [vocabulary_size, embedding_size])
inputs = tf.nn.embedding_lookup(embedding, input_data)
When you execute (after building your graph and before stating the training), just assign your saved embeddings to the embedding variable:
session.run(tf.assign(embedding, embeddings_that_you_want_to_use))
The idea is that the embedding_lookup will replace input_data values with those present in the embedding variable.