LSTM network on pre trained word embedding gensim

LSTM network on pre trained word embedding gensim - python

I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.
from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
max_sequence_size = 14
classes_num = 2
LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)
merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))
predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)
my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()
The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300). On searching, I found that people have used Flatten() which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.
While explaining, kindly let me know how do the dimension work out.
Upon request:
My X_training had dimension issues, so I am providing the code below to clear out the confusion,
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
I think the following code is giving 2-D numpy array as I am initializing it that way
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print "Question %d of %d" % (counter, len(reviews))
reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
num_features)
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["question"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
My objective is just to use gensim pretrained model for LSTM on some comments that I have.
trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )

You should try using Embedding layer before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.
inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)
Here, maxlen is the maximum length of your comments, max_features is the maximum number of unique words or vocabulary size of your dataset, and embed_size is dimensions of your vectors, which is 300 in your case.
Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs, this should work.

Related

How to get the latent vector as an output from a cnn model before training to the fully connected layer?

I am working on CNN model using Tensorflow frames in google collab. I am unable to extract the latent vectors from the convolutional layers. I want to extract the output of the convolutional layers, the layers before fully connected layer.
I have tried with the following code
a = dropout()(classifier_model.output)
print(a)
I am unable to understand the solution suggested on the link Stackoverflow solution to print the value of tensorflow object after applying a-conv-pool-layer
Anyone with any suggestion?

You can use get_layer method of the Model class to get a layer by its name, find bellow an example with a dummy 1D CNN and a binary classifier :
timesteps = 100
nfeatures = 2
# build the model using the functional API
# example of a 1D CNN inspired by the your stack overflow link, but using a model instead of successive *raw* layers
# the values of the Conv1D filters and kernels are different
input = Input((timesteps, nfeatures))
p = Conv1D(filters=16, kernel_size=10)(input)
p = ReLU()(p)
p = MaxPool1D(pool_size=2)(p)
p = Conv1D(filters=32, kernel_size=10)(p)
p = ReLU()(p)
p = MaxPool1D(pool_size=2)(p)
p = Conv1D(filters=64, kernel_size=10)(p)
p = ReLU()(p)
p = MaxPool1D(pool_size=2, name='conv1Dfeat')(p) # give a name to the CNN output
# fully connected part
p = Flatten()(p)
p = Dense(10)(p)
# could add a dropout layer to ease optimization
finaloutput = Dense(1, activation='sigmoid')(p)
# full model
model = Model(inputs=input, outputs=finaloutput)
# compile network, i.e. define optimizer, loss and metrics
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
You need to train the model using the fit method with some data. Then you can get the output of the layer which name is conv1Dfeat (the last layer of the convolutive part) by defining the model:
modelCNN = Model(inputs=input, outputs=model.get_layer('conv1Dfeat').output)
modelCNN.summary()
If you want to get the output of the convolutive part, let's say based on a single numpy input array of shape (timesteps, nfeatures), you can use the predict of the Model class on batched data:
data = np.random.normal(size=(timesteps, nfeatures)) # dummy data
data_tf = tf.expand_dims(data, axis=0) # convert to TF tensor and add batch dimension at the same time
cnn_out_np = modelCNN.predict(data_tf)
cnn_out_np = np.squeeze(cnn_out_np, axis=0) # remove batch dimension
print(cnn_out_np.shape)
(4, 64)

Siamese LSTM for document similarity using keras giving input error

I have written this code to use the Siamese method to calculate the similarity of two documents. I want to embed my vectorize layer (embedding is performed using Google News Dataset) of two separate documents using vectorization approach and then feed it to LSTM and output of LSTM goes into Cosine function to measure the similarity of two documents.
#importing libraries
from __future__ import print_function
import gensim
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import csv
import re
import pandas as pd
from pandas import DataFrame
import pandas as pd
nltk.download('punkt')
from tensorflow import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, TimeDistributed
from tensorflow.keras import layers
#Loading pre=trained word2vec model
from gensim.models.keyedvectors import KeyedVectors
# You need to dowload google pre-trained model using below link
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
#Change the path according to your directory
model_path = 'D:\GoogleNews_vectors_negative300\GoogleNews_vectors_negative300.bin'
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
#Setting Parameters for model
class DocSim(object):
def __init__(self, w2v_model , stopwords=[]):
self.w2v_model = w2v_model
self.stopwords = stopwords
def vectorize(self, doc):
"""Identify the vector values for each word in the given document"""
doc = doc.lower()
words = [w for w in doc.split(" ") if w not in self.stopwords]
word_vecs = []
for word in words:
try:
vec = self.w2v_model[word]
word_vecs.append(vec)
except KeyError:
# Ignore, if the word doesn't exist in the vocabulary
pass
# Assuming that document vector is the mean of all the word vectors
vector = np.mean(word_vecs, axis=0)
return vector
def Siamese_cosine_sim(self, vectorA, vectorB):
model = Sequential()
model.add(LSTM(20, return_sequences=True),input_shape=[vectorA,vectorB])
model.compile(loss='binary_crossentropy', optimizer='adam')
outputs = layers.Dense(1, activation="sigmoid")(left_doc,right_doc)
"""Find the cosine similarity distance between two vectors."""
csim = np.dot(left_doc, right_doc) / (np.linalg.norm(left_doc) * np.linalg.norm(right_doc))
if np.isnan(np.sum(csim)):
return 0
return csim
def calculate_similarity(self, withdigits_source_rules, withdigits_target_rules=[], threshold=0.8):
"""Calculates & returns similarity scores between given source rules & all
the target rules"""
if isinstance(withdigits_target_rules, str):
withdigits_target_rules = [withdigits_target_rules]
source_vec = self.vectorize(withdigits_source_rules)
results = []
for rule in withdigits_target_rules:
target_vec = self.vectorize(rule)
sim_score = self.Siamese_cosine_sim (source_vec, target_vec)
if sim_score > threshold:
results.append({
'Siamese Sim Score':sim_score,
'Target Rule':rule
})
# Sort results by score in desc order
results.sort(key=lambda k : k['Siamese Sim Score'] , reverse=True)
return results
ds = DocSim(w2v_model)
#Two documents data
withdigits_source_rules =set(["2.1 Separation of trains","2.3.1.2 Level crossing obstruction","2.2.1.1 Safety is compromised if a train proceeds without a movement autority","Principle: The method of signalling must maintain a space interval between trains that is safe.","2.1.1 Context"])
#Calculate the similarity score between a source rule & a target rule.
if isinstance(withdigits_source_rules, str):
withdigits_source_rules = [withdigits_source_rules]
# This will return one target rules text with a similarity score
for rule in withdigits_source_rules:
sim_scores= ds.calculate_similarity(rule, withdigits_target_rules)
# Printing the output in text file
print("Source rule: {} \n\nSimilarity with Target Rule is \n\n {}\n".format(rule, sim_scores) , file=open("output.txt", "a"))
print("\n")
# Printing output in Jupyter
print("Source rule: {} \n\nSimilarity with Target Rule is \n\n {}\n".format(rule, sim_scores) )
print("\n")
I am getting following error if someone can help me to solve this issue along with LSTM input function?
TypeError: add() got an unexpected keyword argument 'input_shape'

Refer to the documentation here for adding layers to a Sequential model. The add method only accepts one parameter - layer. If the passed argument is not a layer instance, it raises TypeError, which is precisely the error it threw. I guess, you wanted to pass the input_shape parameter to the LSTM layer (line after creating Sequential model). Just move it inside the LSTM layer and it should work fine.

Keras: Share a layer of weights across Training Examples (Not between layers)

The problem is the following. I have a categorical prediction task of vocabulary size 25K. On one of them (input vocab 10K, output dim i.e. embedding 50), I want to introduce a trainable weight matrix for a matrix multiplication between the input embedding (shape 1,50) and the weights (shape(50,128)) (no bias) and the resulting vector score is an input for a prediction task along with other features.
The crux is, I think that the trainable weight matrix varies for each input, if I simply add it in. I want this weight matrix to be common across all inputs.
I should clarify - by input here I mean training examples. So all examples would learn some example specific embedding and be multiplied by a shared weight matrix.
After every so many epochs, I intend to do a batch update to learn these common weights (or use other target variables to do multiple output prediction)
LSTM? Is that something I should look into here?

With the exception of an Embedding layer, layers apply to all examples in the batch.
Take as an example a very simple network:
inp = Input(shape=(4,))
h1 = Dense(2, activation='relu', use_bias=False)(inp)
out = Dense(1)(h1)
model = Model(inp, out)
This a simple network with 1 input layer, 1 hidden layer and an output layer. If we take the hidden layer as an example; this layer has a weights matrix of shape (4, 2,). At each iteration the input data which is a matrix of shape (batch_size, 4) is multiplied by the hidden layer weights (feed forward phase). Thus h1 activation is dependent on all samples. The loss is also computed on a per batch_size basis. The output layer has a shape (batch_size, 1). Given that in the forward phase all the batch samples affected the values of the weights, the same is true for backdrop and gradient updates.
When one is dealing with text, often the problem is specified as predicting a specific label from a sequence of words. This is modelled as a shape of (batch_size, sequence_length, word_index). Lets take a very basic example:
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
sequence_length = 80
emb_vec_size = 100
vocab_size = 10_000
def make_model():
inp = Input(shape=(sequence_length, 1))
emb = Embedding(vocab_size, emb_vec_size)(inp)
emb = Reshape((sequence_length, emb_vec_size))(emb)
h1 = Dense(64)(emb)
recurrent = LSTM(32)(h1)
output = Dense(1)(recurrent)
model = Model(inp, output)
model.compile('adam', 'mse')
return model
model = make_model()
model.summary()
You can copy and paste this into colab and see the summary.
What this example is doing is:
Transform a sequence of word indices into a sequence of word embedding vectors.
Applying a Dense layer called h1 to all the batches (and all the elements in the sequence); this layer reduces the dimensions of the embedding vector. It is not a typical element of a network to process text (in isolation). But this seemed to match your question.
Using a recurrent layer to reduce the sequence into a single vector per example.
Predicting a single label from the "sentence" vector.

If I get the problem correctly you can reuse layers or even models inside another model.
Example with a Dense layer. Let's say you have 10 Inputs
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
range(10)]
# defining a common Dense layer
D = Dense(64, name='one_layer_to_rule_them_all')
nets = [D(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
This code is not going to work if the inputs have different shapes. The first call to D defines its properties. In this example, outputs are set directly to nets. But of course you can concatenate, stack, or whatever you want.
Now if you have some trainable model you can use it instead of the D:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
range(10)]
# defining a shared model with the same weights for all inputs
nets = [special_model(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
The weights of this model are shared among all inputs.

Using pretrained gensim Word2vec embedding in keras

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?
PS: It is different from other questions because I am using gensim for word2vec training instead of keras.

Let's say you have following data that you need to encode
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
You can then enocde it to sequences like this
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
You can then pad the sequences so that all the sequences are of a fixed length
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Then use the word2vec model to make embedding matrix
# load embedding as a dict
def load_embedding(filename):
# load embedding into memory, skip first line
file = open(filename,'r')
lines = file.readlines()[1:]
file.close()
# create a map of words to vectors
embedding = dict()
for line in lines:
parts = line.split()
# key is string word, value is numpy array for vector
embedding[parts[0]] = asarray(parts[1:], dtype='float32')
return embedding
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = zeros((vocab_size, 100))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = embedding.get(word)
return weight_matrix
# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)
Once you have the embedding matrix you can use it in Embedding layer like this
e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)
This layer can be used in making a model like this
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove
For using word2vec see this post

With the new Gensim version this is pretty easy:
w2v_model.wv.get_keras_embedding(train_embeddings=False)
there you have your Keras embedding layer

My code for gensim-trained w2v model. Assume all words trained in the w2v model is now a list variable called all_words.
from keras.preprocessing.text import Tokenizer
import gensim
import pandas as pd
import numpy as np
from itertools import chain
w2v = gensim.models.Word2Vec.load("models/w2v.model")
vocab = w2v.wv.vocab
t = Tokenizer()
vocab_size = len(all_words) + 1
t.fit_on_texts(all_words)
def get_weight_matrix():
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, w2v.vector_size))
# step vocab, store vectors using the Tokenizer's integer mapping
for i in range(len(all_words)):
weight_matrix[i + 1] = w2v[all_words[i]]
return weight_matrix
embedding_vectors = get_weight_matrix()
emb_layer = Embedding(vocab_size, output_dim=w2v.vector_size, weights=[embedding_vectors], input_length=FIXED_LENGTH, trainable=False)

Concatenate Embedding layers

I'm trying to create a model which has words as inputs. Most of those words are in the glove word vector set (~50000). However, some of the frequent words are not (~1000). The question is, how do I concatenate the following two embedding layers to create one giant Embedding lookup table?
trained_em = Embedding(50000, 50,
weights=np.array([word2glove[w] for w in words_in_glove]),
trainable=False)
untrained_em = Embedding(1000, 50)
As far as I understand these are simply two lookup tables with same number of dimensions. So I'm hoping that there is a way to stack these two lookup tables.
Edit 1:
I just realised that this is probably going to be more than stacking Embedding layers because the input sequence would be a number from 0-50999. However untrained_em above only expect a number from 0-999. So perhaps a different solution is required.
Edit 2:
This is what I would expect to do in a numpy array representing the Embedding:
np.random.seed(42) # Set seed for reproducibility
pretrained = np.random.randn(15,3)
untrained = np.random.randn(5,3)
final_embedding = np.vstack([pretrained, untrained])
word_idx = [2, 5, 19]
np.take(final_embedding, word_idx, axis=0)
I believe the last bit can be done with something to do with keras.backend.gather but not sure how to put it all together.

Turns out that I need to implement a custom layer. Which was implemented by tweaking the orignial Embedding class.
The two most important parts shown in the class below are self.embeddings = K.concatenate([fixed_weight, variable_weight], axis=0) and out = K.gather(self.embeddings, inputs). The first is hopefully self explanatory while the second picks out the relevant input rows from the embeddings table.
However, in the particular application that I'm working on, turns out that it works out better using an Embedding layer instead of the modified layer. Perhaps because the learning rate is too high. I will report back on this after I have experimented more.
from keras.engine.topology import Layer
import keras.backend as K
from keras import initializers
import numpy as np
class Embedding2(Layer):
def __init__(self, input_dim, output_dim, fixed_weights, embeddings_initializer='uniform',
input_length=None, **kwargs):
kwargs['dtype'] = 'int32'
if 'input_shape' not in kwargs:
if input_length:
kwargs['input_shape'] = (input_length,)
else:
kwargs['input_shape'] = (None,)
super(Embedding2, self).__init__(**kwargs)
self.input_dim = input_dim
self.output_dim = output_dim
self.embeddings_initializer = embeddings_initializer
self.fixed_weights = fixed_weights
self.num_trainable = input_dim - len(fixed_weights)
self.input_length = input_length
def build(self, input_shape, name='embeddings'):
initializer = initializers.get(self.embeddings_initializer)
shape1 = (self.num_trainable, self.output_dim)
variable_weight = K.variable(initializer(shape1), dtype=K.floatx(), name=name+'_var')
fixed_weight = K.variable(self.fixed_weights, name=name+'_fixed')
self._trainable_weights.append(variable_weight)
self._non_trainable_weights.append(fixed_weight)
self.embeddings = K.concatenate([fixed_weight, variable_weight], axis=0)
self.built = True
def call(self, inputs):
if K.dtype(inputs) != 'int32':
inputs = K.cast(inputs, 'int32')
out = K.gather(self.embeddings, inputs)
return out
def compute_output_shape(self, input_shape):
if not self.input_length:
input_length = input_shape[1]
else:
input_length = self.input_length
return (input_shape[0], input_length, self.output_dim)

So, my suggestion is to use only one Embedding layer (taking into consideration your indexing problem), and transfer the weights from the old layer to the new one.
So, what you're going to to in this suggestion is...
Create your new model with 51000 words:
inp = Input((1,))
emb = Embedding(51000,50)(inp)
out = the rest of the model.....
model = Model(inp,out)
Now take the embedding layer and give it the weights you had:
weights = np.array([word2glove[w] for w in words_in_glove])
newWeights = model.layers[1].get_weights()[0]
newWeights[:50000,:] = weights
model.layers[1].set_weights([newWeights])
This will give you a new embedding, larger than the previous one, with a great part of its weights already trained, and the remaining randomly initialized.
Unfortunately, you will have to let everything be trained.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

LSTM network on pre trained word embedding gensim - python

Related

How to get the latent vector as an output from a cnn model before training to the fully connected layer?

Siamese LSTM for document similarity using keras giving input error

Keras: Share a layer of weights across Training Examples (Not between layers)

Using pretrained gensim Word2vec embedding in keras

Concatenate Embedding layers

Categories

Resources