How to build a attention model with keras? - python

I am trying to understand attention model and also build one myself. After many searches I came across this website which had an atteniton model coded in keras and also looks simple. But when I tried to build that same model in my machine its giving multiple argument error. The error was due to the mismatched argument passing in class Attention. In the website's attention class it's asking for one argument but it initiates the attention object with two arguments.
import tensorflow as tf
max_len = 200
rnn_cell_size = 128
vocab_size=250
class Attention(tf.keras.Model):
def __init__(self, units):
super(Attention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden):
hidden_with_time_axis = tf.expand_dims(hidden, 1)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
attention_weights = tf.nn.softmax(self.V(score), axis=1)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
sequence_input = tf.keras.layers.Input(shape=(max_len,), dtype='int32')
embedded_sequences = tf.keras.layers.Embedding(vocab_size, 128, input_length=max_len)(sequence_input)
lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM
(rnn_cell_size,
dropout=0.3,
return_sequences=True,
return_state=True,
recurrent_activation='relu',
recurrent_initializer='glorot_uniform'), name="bi_lstm_0")(embedded_sequences)
lstm, forward_h, forward_c, backward_h, backward_c = tf.keras.layers.Bidirectional \
(tf.keras.layers.LSTM
(rnn_cell_size,
dropout=0.2,
return_sequences=True,
return_state=True,
recurrent_activation='relu',
recurrent_initializer='glorot_uniform'))(lstm)
state_h = tf.keras.layers.Concatenate()([forward_h, backward_h])
state_c = tf.keras.layers.Concatenate()([forward_c, backward_c])
# PROBLEM IN THIS LINE
context_vector, attention_weights = Attention(lstm, state_h)
output = keras.layers.Dense(1, activation='sigmoid')(context_vector)
model = keras.Model(inputs=sequence_input, outputs=output)
# summarize layers
print(model.summary())
How can I make this model work?

There is a problem with the way you initialize attention layer and pass parameters. You should specify the number of attention layer units in this place and modify the way of passing in parameters:
context_vector, attention_weights = Attention(32)(lstm, state_h)
The result:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 200) 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 200, 128) 32000 input_1[0][0]
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional) [(None, 200, 256), ( 263168 embedding[0][0]
__________________________________________________________________________________________________
bidirectional (Bidirectional) [(None, 200, 256), ( 394240 bi_lstm_0[0][0]
bi_lstm_0[0][1]
bi_lstm_0[0][2]
bi_lstm_0[0][3]
bi_lstm_0[0][4]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 256) 0 bidirectional[0][1]
bidirectional[0][3]
__________________________________________________________________________________________________
attention (Attention) [(None, 256), (None, 16481 bidirectional[0][0]
concatenate[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 1) 257 attention[0][0]
==================================================================================================
Total params: 706,146
Trainable params: 706,146
Non-trainable params: 0
__________________________________________________________________________________________________
None

Attention layers are part of Keras API of Tensorflow(2.1) now. But it outputs the same sized tensor as your "query" tensor.
This is how to use Luong-style attention:
query_attention = tf.keras.layers.Attention()([query, value])
And Bahdanau-style attention :
query_attention = tf.keras.layers.AdditiveAttention()([query, value])
The adapted version:
attention_weights = tf.keras.layers.Attention()([lstm, state_h])
Check out the original website for more information: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention
https://www.tensorflow.org/api_docs/python/tf/keras/layers/AdditiveAttention

To answer Arman's specific query - these libraries use post-2018 semantics of queries, values and keys. To map the semantics back to Bahdanau or Luong's paper, you can consider the 'query' to be the last decoder hidden state. The 'values' will be the set of the encoder outputs - all the hidden states of the encoder. The 'query' 'attends' to all the 'values'.
Whichever version of code or library you are using, always note that the 'query' will be expanded over the time axis to prepare it for the subsequent addition that follows. This value (that is being expanded) will always be the last hidden state of the RNN. The other value will always be the values that need to be attended to - all the hidden states at the encoder end. This simple check of the code can be done to determine what 'query' and 'values' map to irrespective of the library or code that you are using.
You can refer to https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e to write your own custom attention layer in less than 6 lines of code

Related

Bidirectional LSTM in encoder decoder model running out of memory on training

latent_dim = 500
embedding_dim = 256
# Encoder
encoder_inputs = Input(shape=(max_eng_len,))
enc_emb = Embedding(x_voc_size, embedding_dim,trainable=True)(encoder_inputs)
#LSTM 1
encoder_lstm1 = Bidirectional(LSTM(latent_dim,return_sequences=True,return_state=True))
encoder_output1, forw_state_h, forw_state_c, back_state_h, back_state_c = encoder_lstm1(enc_emb)
final_enc_h = Concatenate()([forw_state_h,back_state_h])
final_enc_c = Concatenate()([forw_state_c,back_state_c])
encoder_states =[final_enc_h, final_enc_c]
# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(y_voc_size, embedding_dim,trainable=True)
dec_emb = dec_emb_layer(decoder_inputs)
#LSTM using encoder_states as initial state
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, return_state=True)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=encoder_states)
#from tensorflow.keras.layers import Attention
#Attention Layer
attention_layer = AttentionLayer()
attn_res, attn_weight = attention_layer([encoder_output1, decoder_outputs])
# Concat attention output and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_res])
#Dense layer
decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)
# model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()
# Compile
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
checkpoint = ModelCheckpoint("/content/drive/My Drive/checkpoint.txt", monitor='val_accuracy')
early_stopping = EarlyStopping(monitor='val_accuracy', patience=5)
callbacks_list = [checkpoint, early_stopping]
# Training set
encoder_input_data = X_train
decoder_input_data = Y_train[:,:-1]
decoder_target_data = Y_train[:,1:]
# devlopment set
encoder_input_test = X_test
decoder_input_test = Y_test[:,:-1]
decoder_target_test= Y_test[:,1:]
history = model.fit([encoder_input_data, decoder_input_data],decoder_target_data,
epochs=50,
batch_size=64,
validation_data = ([encoder_input_test, decoder_input_test],decoder_target_test),
callbacks= callbacks_list)
x_voc_size is 45701 and y_voc_size is 84213. Approximately there are 45,000 records. I am getting memory error while training this model on 35GB RAM. Even after reducing the batch size to 25, I am getting the same error. Please suggest how to go about this error.
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 5515)] 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 5515, 256) 11699456 input_1[0][0]
__________________________________________________________________________________________________
input_2 (InputLayer) [(None, None)] 0
__________________________________________________________________________________________________
bidirectional (Bidirectional) [(None, 5515, 1000), 3028000 embedding[0][0]
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, None, 256) 21558528 input_2[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 1000) 0 bidirectional[0][1]
bidirectional[0][3]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 1000) 0 bidirectional[0][2]
bidirectional[0][4]
__________________________________________________________________________________________________
lstm_1 (LSTM) [(None, None, 1000), 5028000 embedding_1[0][0]
concatenate[0][0]
concatenate_1[0][0]
__________________________________________________________________________________________________
attention_layer (AttentionLayer ((None, None, 1000), 2001000 bidirectional[0][0]
lstm_1[0][0]
__________________________________________________________________________________________________
concat_layer (Concatenate) (None, None, 2000) 0 lstm_1[0][0]
attention_layer[0][0]
__________________________________________________________________________________________________
time_distributed (TimeDistribut (None, None, 84213) 168510213 concat_layer[0][0]
==================================================================================================
Total params: 211,825,197
Trainable params: 211,825,197
Non-trainable params: 0
__________________________________________________________________________________________________
EDIT - This is the model's summary. I think the parameters are huge. But how to efficiently reduce the complexity of the model?
That's quite the model and I bet you if we can talk it out a bit we can find something suitable for your use case. To give you an idea of where you stand, you're really going to want a cloud tpu cluster for something that big. I've been through most of the deeplearning ai specializations now and they choose a cloud tpu cluster between 5,000,000 and 13,000,000 parameters. The model you have there is something that would really want to be trained in a bigger corporate data center or national lab environment. In lieu of that though, it would be really good for you to check out transfer learning as large numbers of great models have already been trained in that environment and you could piggyback off them for free. I'd say if you can bring down the number of trainable parameters to something like 3,000,000, you might find something much, much more amenable for your hardware. Please, let's turn this into a conversation so everyone gets to learn. Let me know your thoughts!

How to fix ValueError: Input 0 is incompatible with layer CNN: expected shape=(None, 35), found shape=(None, 31)

I am using Convolutional Neural Network to train a text classification task, using Keras, Conv1D. When I run the model below to my multi class text classification task, I get error such as following. I put time to undrestand the error but I don't know how to fix it. can anyone help me please?
The data set and evaluation set shape is such as following:
df_train shape: (7198,)
df_val shape: (1800,)
np.random.seed(42)
#You needs to reshape your input data according to Conv1D layer input format - (batch_size, steps, input_dim). Try
# set parameters of matrices and convolution
embedding_dim = 300
nb_filter = 64
filter_length = 5
hidden_dims = 32
stride_length = 1
from keras.layers import Embedding
embedding_layer = Embedding(len(tokenizer.word_index) + 1,
embedding_dim,
input_length=35,
name="Embedding")
inp = Input(shape=(35,), dtype='int32')
embeddings = embedding_layer(inp)
conv1 = Conv1D(filters=32, # Number of filters to use
kernel_size=filter_length, # n-gram range of each filter.
padding='same', #valid: don't go off edge; same: use padding before applying filter
activation='relu',
name="CONV1",
kernel_regularizer=regularizers.l2(l=0.0367))(embeddings)
conv2 = Conv1D(filters=32, # Number of filters to use
kernel_size=filter_length, # n-gram range of each filter.
padding='same', #valid: don't go off edge; same: use padding before applying filter
activation='relu',
name="CONV2",kernel_regularizer=regularizers.l2(l=0.02))(embeddings)
conv3 = Conv1D(filters=32, # Number of filters to use
kernel_size=filter_length, # n-gram range of each filter.
padding='same', #valid: don't go off edge; same: use padding before applying filter
activation='relu',
name="CONV2",kernel_regularizer=regularizers.l2(l=0.01))(embeddings)
max1 = MaxPool1D(10, strides=1,name="MaxPool1D1")(conv1)
max2 = MaxPool1D(10, strides=1,name="MaxPool1D2")(conv2)
max3 = MaxPool1D(10, strides=1,name="MaxPool1D2")(conv3)
conc = concatenate([max1, max2,max3])
flat = Flatten(name="FLATTEN")(max1)
....
Error is like following:
ValueError: Input 0 is incompatible with layer CNN: expected shape=(None, 35), found shape=(None, 31)
The model :
Model: "CNN"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_19 (InputLayer) [(None, 35)] 0
_________________________________________________________________
Embedding (Embedding) (None, 35, 300) 4094700
_________________________________________________________________
CONV1 (Conv1D) (None, 35, 32) 48032
_________________________________________________________________
MaxPool1D1 (MaxPooling1D) (None, 26, 32) 0
_________________________________________________________________
FLATTEN (Flatten) (None, 832) 0
_________________________________________________________________
Dropout (Dropout) (None, 832) 0
_________________________________________________________________
Dense (Dense) (None, 3) 2499
=================================================================
Total params: 4,145,231
Trainable params: 4,145,231
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
That error comes when you have not matched the network's input layer shape and the dataset's shape. If are you receiving an error like this, then you should try:
Set the network input shape at (None, 31) so that it matches the Dataset's shape.
Check that the dataset's shape is equal to (num_of_examples, 35).(Preferable)
If all of this informations are correct and there is no problem with the Dataset, it might be an error of the net itself, where the shapes af two adjcent layers don't match.

Training the same model with two different outputs with Keras

I have a simple GRU network coded with Keras in python as below:
gru1 = GRU(16, activation='tanh', return_sequences=True)(input)
dense = TimeDistributed(Dense(16, activation='tanh'))(gru1)
output = TimeDistributed(Dense(1, activation="sigmoid"))(dense)
I've used a sigmoid activation for output since my purpose is classification. But I need to use the same model for regression as well. I'll need to change the output activation as linear. However, the rest of the network is still the same. So in this case, I'll use two different networks for two different purposes. Inputs are the same. But outputs are classes for sigmoid and values for linear activation.
My question is, is there any way to use only one network but get two different outputs at the end? Thanks.
Yes, you can use functional API to design a multi-output model. You can keep shared layers and 2 different outputs one with sigmoid another with linear activation.
N.B: Don't use input as a variable, it's a function name in python.
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
seq_len = 100 # your sequence length
input_ = Input(shape=(seq_len,1))
gru1 = GRU(16, activation='tanh', return_sequences=True)(input_)
dense = TimeDistributed(Dense(16, activation='tanh'))(gru1)
output1 = TimeDistributed(Dense(1, activation="sigmoid", name="out1"))(dense)
output2 = TimeDistributed(Dense(1, activation="linear", name="out2"))(dense)
model = Model(input_, [output1, output2])
model.summary()
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_3 (InputLayer) [(None, 100, 1)] 0
__________________________________________________________________________________________________
gru_2 (GRU) (None, 100, 16) 912 input_3[0][0]
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 100, 16) 272 gru_2[0][0]
__________________________________________________________________________________________________
time_distributed_4 (TimeDistrib (None, 100, 1) 17 time_distributed_3[0][0]
__________________________________________________________________________________________________
time_distributed_5 (TimeDistrib (None, 100, 1) 17 time_distributed_3[0][0]
==================================================================================================
Total params: 1,218
Trainable params: 1,218
Non-trainable params: 0
Compiling with two loss functions:
losses = {
"out1": "binary_crossentropy",
"out2": "mse",
}
# initialize the optimizer and compile the model
model.compile(optimizer='adam', loss=losses, metrics=["accuracy", "mae"])

In Keras elmo embedding layer has 0 parameters? is this normal?

So I was using GloVe with my model and it worked, but now I changed to Elmo (reference that Keras code available on GitHub Elmo Keras Github, utils.py
however, when I print model.summary I get 0 parameters in the ELMo Embedding layer unlike when I was using Glove is that normal ? If not can you please tell me what am I doing wrong
Using glove I Got over 20Million parameters
##--------> When I was using Glove Embedding Layer
word_embedding_layer = emb.get_keras_embedding(#dropout = emb_dropout,
trainable = True,
input_length = sent_maxlen,
name='word_embedding_layer')
## --------> Deep layers
pos_embedding_layer = Embedding(output_dim =pos_tag_embedding_size, #5
input_dim = len(SPACY_POS_TAGS),
input_length = sent_maxlen, #20
name='pos_embedding_layer')
latent_layers = stack_latent_layers(num_of_latent_layers)
##--------> 6] Dropout
dropout = Dropout(0.1)
## --------> 7]Prediction
predict_layer = predict_classes()
## --------> 8] Prepare input features, and indicate how to embed them
inputs = [Input((sent_maxlen,), dtype='int32', name='word_inputs'),
Input((sent_maxlen,), dtype='int32', name='predicate_inputs'),
Input((sent_maxlen,), dtype='int32', name='postags_inputs')]
## --------> 9] ELMo Embedding and Concat all inputs and run on deep network
from elmo import ELMoEmbedding
import utils
idx2word = utils.get_idx2word()
ELmoembedding1 = ELMoEmbedding(idx2word=idx2word, output_mode="elmo", trainable=True)(inputs[0]) # These two are interchangeable
ELmoembedding2 = ELMoEmbedding(idx2word=idx2word, output_mode="elmo", trainable=True)(inputs[1]) # These two are interchangeable
embeddings = [ELmoembedding1,
ELmoembedding2,
pos_embedding_layer(inputs[3])]
con1 = keras.layers.concatenate(embeddings)
## --------> 10]Build model
outputI = predict_layer(dropout(latent_layers(con1)))
model = Model(inputs, outputI)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.summary()
Trials:
note: I tried using the TF-Hub Elmo with Keras code, but the output was always a 2D tensor [even when I changed it to 'Elmo' setting and 'LSTM' instead of default']so I couldn't Concatenate with POS_embedding_layer. I tried reshaping but eventually I got the same issue total Parameters 0.
From the TF-Hub description (https://tfhub.dev/google/elmo/2), the embeddings of individual words are not trainable. Only the weighted sum of the embedding and LSTM layers are. So you should get 4 trainable parameters at the ELMo level.
I was able to get the trainable parameters using the class defined in StrongIO's example on Github. The example only provides a class where the output is the default layer, which is a 1024 vector for each input example (essentially a document/sentence encoder). To access the embeddings of each word (the elmo layer), a few changes are needed as suggested in this issue:
class ElmoEmbeddingLayer(Layer):
def __init__(self, **kwargs):
self.dimensions = 1024
self.trainable=True
super(ElmoEmbeddingLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable,
name="{}_module".format(self.name))
self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
super(ElmoEmbeddingLayer, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(
K.squeeze(
K.cast(x, tf.string), axis=1
),
as_dict=True,
signature='default',
)['elmo']
return result
def compute_output_shape(self, input_shape):
return (input_shape[0], None, self.dimensions)
You can stack the ElmoEmbeddingLayer with the POS layer.
As a more general example, one can use the ELMo embeddings in a 1D ConvNet model for classification:
elmo_input_layer = Input(shape=(None, ), dtype="string")
elmo_output_layer = ElmoEmbeddingLayer()(elmo_input_layer)
conv_layer = Conv1D(
filters=100,
kernel_size=3,
padding='valid',
activation='relu',
strides=1)(elmo_output_layer)
pool_layer = GlobalMaxPooling1D()(conv_layer)
dense_layer = Dense(32)(pool_layer)
output_layer = Dense(1, activation='sigmoid')(dense_layer)
model = Model(
inputs=elmo_input_layer,
outputs=output_layer)
model.summary()
The model summary looks like this:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_62 (InputLayer) (None, None) 0
_________________________________________________________________
elmo_embedding_layer_13 (Elm (None, None, 1024) 4
_________________________________________________________________
conv1d_46 (Conv1D) (None, None, 100) 307300
_________________________________________________________________
global_max_pooling1d_42 (Glo (None, 100) 0
_________________________________________________________________
dense_53 (Dense) (None, 32) 3232
_________________________________________________________________
dense_54 (Dense) (None, 1) 33
=================================================================
Total params: 310,569
Trainable params: 310,569
Non-trainable params: 0
_________________________________________________________________

How to calculate the trainable param quantity to be 335872 in this LSTM sample code?

I got this sample code but can't figure out how to calculate the trainable parameters to be 335872? (showed in the following output)
I would appreciate it if anyone could help on this question. Thanks!
-------------------------code------------------------------------
input_shape = (None, num_encoder_tokens)
# Define an input sequence and process it.
encoder_inputs = Input(shape=input_shape)
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary(line_length=100)
encoder_model.output_shape
---------------------output is as follows----------------------
Layer (type) Output Shape Param #
=================================================================================
input_2 (InputLayer) (None, None, 71) 0
_________________________________________________________________________________
lstm_5 (LSTM) [(None, 256), (None, 256), (None, 256)] 335872
=================================================================================
Total params: 335,872
Trainable params: 335,872
Non-trainable params: 0
_________________________________________________________________________________
[(None, 256), (None, 256)]
I am assuming that you want to know how to train the model so that weight matrices, biases, etc. can be calculated.
The problem with your code is that you have only defined the architecture of your model. You haven't really compiled it. Do this in the end:
encoder_model.compile(loss='binary_crossentropy', optimizer='adam', metrics='binary_accuracy')
In the above line of code, loss, optimizer and metrics is up to you to choose depending on the type of your problem.

Categories