Using pretrained gensim Word2vec embedding in keras - python

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?
PS: It is different from other questions because I am using gensim for word2vec training instead of keras.

Let's say you have following data that you need to encode
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
You can then enocde it to sequences like this
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
You can then pad the sequences so that all the sequences are of a fixed length
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Then use the word2vec model to make embedding matrix
# load embedding as a dict
def load_embedding(filename):
# load embedding into memory, skip first line
file = open(filename,'r')
lines = file.readlines()[1:]
file.close()
# create a map of words to vectors
embedding = dict()
for line in lines:
parts = line.split()
# key is string word, value is numpy array for vector
embedding[parts[0]] = asarray(parts[1:], dtype='float32')
return embedding
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = zeros((vocab_size, 100))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = embedding.get(word)
return weight_matrix
# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)
Once you have the embedding matrix you can use it in Embedding layer like this
e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)
This layer can be used in making a model like this
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove
For using word2vec see this post

With the new Gensim version this is pretty easy:
w2v_model.wv.get_keras_embedding(train_embeddings=False)
there you have your Keras embedding layer

My code for gensim-trained w2v model. Assume all words trained in the w2v model is now a list variable called all_words.
from keras.preprocessing.text import Tokenizer
import gensim
import pandas as pd
import numpy as np
from itertools import chain
w2v = gensim.models.Word2Vec.load("models/w2v.model")
vocab = w2v.wv.vocab
t = Tokenizer()
vocab_size = len(all_words) + 1
t.fit_on_texts(all_words)
def get_weight_matrix():
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, w2v.vector_size))
# step vocab, store vectors using the Tokenizer's integer mapping
for i in range(len(all_words)):
weight_matrix[i + 1] = w2v[all_words[i]]
return weight_matrix
embedding_vectors = get_weight_matrix()
emb_layer = Embedding(vocab_size, output_dim=w2v.vector_size, weights=[embedding_vectors], input_length=FIXED_LENGTH, trainable=False)

Related

How to get the word of a embedding vector from the pretrained model of hugging face?

I use hugging face's pretrained model, bert, to help me get the meaning of sentence pooling(which means tokenize the sentence and get the average vector of all embedding words). My codes are as follows. I want to get the word which pooling vector refers to.
import torch
from transformers import BertModel, BertTokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
# Load the model
model = BertModel.from_pretrained(model_name)
# input sentence
input_text = "Here is some text to encode"
# from tokenizer to token_id
input_ids = tokenizer.encode(input_text, add_special_tokens=True)
# input_ids: [101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]
input_ids = torch.tensor([input_ids])
# get the tensors
with torch.no_grad():
last_hidden_states = model(input_ids)[0] # Models outputs are now tuples
# sentence pooling
last_hidden_states = last_hidden_states.mean(1)
print(last_hidden_states)
# last_hidden_states.shape = [1,768]
After this, I want to get the word of this encode vector([1,768]).
Theoretically, I should use this embedding vecter # embedding_matrix(size is[ dictionary_dimention ,embedding_dimention])
And then use the result of above matrix to be the index of the dictionary.
How could I get the embedding_matrix in embedding layers of hugging face, Please.

how to calculate mean of words' glove embedding in a sentence

I have downloaded the glove trained matrix and used it in a Keras layer. however, I need the sentence embedding for another task.
I want to calculate the mean of all the word embeddings that are in that sentence.
what is the most efficient way to do that since there are about 25000 sentences?
also, I don't want to use a Lambda layer in Keras to get the mean of them.
the best way to do this is to use a GlobalAveragePooling1D layer. it receives the embeddings of tokens inside the sentences from the Embedding layer with the shapes (n_sentence, n_token, emb_dim) and computes the average of each token present in the sentence. the result has shape (n_sentence, emb_dim)
here a code example
embedding_dim = 128
vocab_size = 100
sentence_len = 20
embedding_matrix = np.random.uniform(-1,1, (vocab_size,embedding_dim))
test_sentences = np.random.randint(0,vocab_size, (3,sentence_len))
inp = Input((sentence_len))
embedder = Embedding(vocab_size, embedding_dim,
trainable=False, weights=[embedding_matrix])(inp)
avg = GlobalAveragePooling1D()(embedder)
model = Model(inp, avg)
model.summary()
model(test_sentences) # the mean of all the word embeddings inside sentences

How to get Keras model predicted text back into list of words?

I'm trying to built an Autoencoder neural network for finding outliers in a single column list of text. The text input is like the following:
about_header.png
amaze_header_2.png
amaze_header.png
circle_shape.xml
disableable_ic_edit_24dp.xml
fab_label_background.xml
fab_shadow_black.9.png
fab_shadow_dark.9.png
fab_shadow_light.9.png
fastscroller_handle_normal.xml
fastscroller_handle_pressed.xml
folder_fab.png
The problem is that I don't really know what I'm doing, I'm using Keras, and I've converted these lines of text into a matrix using the Keras Tokenizer, so they can be fed into Keras Model so I can fit and predict them.
The problem is that the predict function returns what I believe is a matrix, and I can't really know for sure what happened because I can't convert the matrix back into the list of text like I originally had.
My entire code is as follows:
import sys
from keras import Input, Model
import matplotlib.pyplot as plt
from keras.layers import Dense
from keras.preprocessing.text import Tokenizer
with open('drawables.txt', 'r') as arquivo:
dados = arquivo.read().splitlines()
tokenizer = Tokenizer(filters='', nb_words=None)
tokenizer.fit_on_texts(dados)
x_dados = tokenizer.texts_to_matrix(dados, mode="count")
tamanho = len(tokenizer.word_index) + 1
tamanho_comprimido = int(tamanho/1.25)
x = Input(shape=(tamanho,))
# Encoder
hidden_1 = Dense(tamanho_comprimido, activation='relu')(x)
h = Dense(tamanho_comprimido, activation='relu')(hidden_1)
# Decoder
hidden_2 = Dense(tamanho, activation='relu')(h)
r = Dense(tamanho, activation='sigmoid')(hidden_2)
autoencoder = Model(input=x, output=r)
autoencoder.compile(optimizer='adam', loss='mse')
history = autoencoder.fit(x_dados, x_dados, epochs=25, shuffle=False)
plt.plot(history.history["loss"])
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.show()
encoded = autoencoder.predict(x_dados)
result = ???????
You can decode the text using original encoding tokenizer.sequences_to_texts. This accepts a list of integer sequences. To get the sequences you can use np.argmax.
encoded_argmax = np.argmax(encoded, axis=1)
text = tokenizer.sequences_to_texts([encoded_argmax]) # since your output is just a number needs to convert into list

LSTM network on pre trained word embedding gensim

I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.
from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
max_sequence_size = 14
classes_num = 2
LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)
merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))
predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)
my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()
The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300). On searching, I found that people have used Flatten() which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.
While explaining, kindly let me know how do the dimension work out.
Upon request:
My X_training had dimension issues, so I am providing the code below to clear out the confusion,
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
I think the following code is giving 2-D numpy array as I am initializing it that way
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print "Question %d of %d" % (counter, len(reviews))
reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
num_features)
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["question"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
My objective is just to use gensim pretrained model for LSTM on some comments that I have.
trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
You should try using Embedding layer before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.
inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)
Here, maxlen is the maximum length of your comments, max_features is the maximum number of unique words or vocabulary size of your dataset, and embed_size is dimensions of your vectors, which is 300 in your case.
Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs, this should work.

Making predictions on text data using keras

i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks
It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.

Categories