Where is Keras LSTM Bias added at inference time?

Where is Keras LSTM Bias added at inference time? - python

The Keras LSTM implementation outputs kernel weights, recurrent weights and a single bias vector. I would have expected there to be a bias for both the kernel weights and the recurrent weights so I am trying to make sure that I understand where this bias is being applied. Consider the randomly initialized example:
test_model = Sequential()
test_model.add(LSTM(4,input_dim=5,input_length=10,return_sequences=True))
for e in zip(test_model.layers[0].trainable_weights, test_model.layers[0].get_weights()):
print('Param %s:\n%s' % (e[0],e[1]))
print(e[1].shape)
This will something like the following:
Param <tf.Variable 'lstm_3/kernel:0' shape=(5, 16) dtype=float32_ref>:
[[-0.46578053 -0.31746995 -0.33488223 0.4640277 -0.46431816 -0.0852727
0.43396038 0.12882692 -0.0822868 -0.23696694 0.4661569 0.4719978
0.12041456 -0.20120585 0.45095628 -0.1172519 ]
[ 0.04213512 -0.24420211 -0.33768272 0.11827284 -0.01744157 -0.09241
0.18402642 0.07530934 -0.28586367 -0.05161515 -0.18925312 -0.19212383
0.07093149 -0.14886391 -0.08835816 0.15116036]
[-0.09760407 -0.27473268 -0.29974532 -0.14995047 0.35970795 0.03962368
0.35579181 -0.21503082 -0.46921644 -0.47543833 -0.51497519 -0.08157375
0.4575423 0.35909468 -0.20627108 0.20574462]
[-0.19834137 0.05490702 0.13013887 -0.52255917 0.20565301 0.12259561
-0.33298236 0.2399289 -0.23061508 0.2385658 -0.08770937 -0.35886696
0.28242612 -0.49390298 -0.23676801 0.09713227]
[-0.21802655 -0.32708862 -0.2184104 -0.28524712 0.37784815 0.50567037
0.47393328 -0.05177036 0.41434419 -0.36551589 0.01406455 0.30521619
0.39916915 0.22952956 0.40699703 0.4528749 ]]
(5, 16)
Param <tf.Variable 'lstm_3/recurrent_kernel:0' shape=(4, 16) dtype=float32_ref>:
[[ 0.28626361 -0.21708137 -0.18340513 -0.02943563 -0.16822724 0.38830781
-0.50277489 -0.07898639 -0.30247116 -0.01375726 -0.34504923 -0.01373435
-0.32458451 -0.03497506 -0.01305341 0.28398186]
[-0.35822678 0.13861786 0.42913082 0.11312254 -0.1593778 0.58666271
0.09238213 -0.24134786 0.2196856 -0.01660753 -0.01929135 -0.02324873
-0.2000526 -0.07921806 -0.33966202 -0.08963238]
[-0.06521184 -0.28180376 0.00445012 -0.32302913 -0.02236169 -0.00901215
0.03330055 0.10727262 0.03839845 -0.58494729 0.36934188 -0.31894827
-0.43042961 0.01130622 0.11946538 -0.13160609]
[-0.31211731 -0.24986106 0.16157174 -0.27083701 0.14389414 -0.23260537
-0.28311059 -0.17966864 -0.28650531 -0.06572254 -0.03313115 0.23230191
0.13236329 0.44721091 -0.42978323 -0.09875761]]
(4, 16)
Param <tf.Variable 'lstm_3/bias:0' shape=(16,) dtype=float32_ref>:
[ 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
(16,)
I grasp that kernel weights are used for the linear transformation of the inputs so they are of shape [input_dim, 4 * hidden_units] or in this case [5, 16] and the kernel weights are used for the linear transformation of the recurrent weights so they are of shape [hidden_units, 4 * hidden_units]. The bias on the other hand is of shape [4 * hidden units] so it is conceivable that it could be added to the recurrent_weights, but not the input transformation. This example shows that the bias as it is output here can only be added to the recurrent_state:
embedding_dim = 5
hidden_units = 4
test_embedding = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
kernel_weights = test_model.layers[0].get_weights()[0]
recurrent_weights = test_model.layers[0].get_weights()[1]
bias = test_model.layers[0].get_weights()[2]
initial_state = np.zeros((hidden_units, 1))
input_transformation = np.dot(np.transpose(kernel_weights), test_embedding[0]) # + bias or + np.transpose(bias) won't work
recurrent_transformation = np.dot(np.transpose(recurrent_weights), initial_state) + bias
print(input_transformation.shape)
print(recurrent_transformation.shape)
Looking at this blog post there are biases added at pretty much every step, so I'm still feeling pretty lost as to where this bias is being applied.
Can anybody help me clarify where the LSTM bias is being added?

The bias is added to the recurrent cell after the matrix multiply. It doesn't matter whether it's added to inputs after the matmul or to the recurrent data after matmul because addition is commutative. See the LSTM equations below:

Related

Using Dropout on output of embedding layer changes array values, Why?

Observing the outputs of embedding layer with and without dropout shows that values in the arrays are replaced with 0. But along with this why other values of array changed ?
Following is my model:-
input = Input(shape=(23,))
model = Embedding(input_dim=n_words, output_dim=23, input_length=23)(input)
model = Dropout(0.2)(model)
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)
Building model2 from trained model , with input as the input layer and output as the output of Dropout(0.2) . -
from keras import backend as K
model2 = K.function([model.layers[0].input , K.learning_phase()],
[model.layers[2].output] )
dropout = model2([X_train[0:1] , 1])[0]
nodrop = model2([X_train[0:1] , 0])[0]
Printing the first array of both dropout and no dropout:
dropout[0][0]
Output-
array([ 0. , -0. , -0. , -0.04656423, -0. ,
0.28391626, 0.12213208, -0.01187495, -0.02078421, -0. ,
0.10585815, -0. , 0.27178472, -0.21080771, 0. ,
-0.09336889, 0.07441022, 0.02960865, -0.2755439 , -0.11252255,
-0.04330419, -0. , 0.04974075], dtype=float32)
-
nodrop[0][0]
Output-
array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
0.22713302, 0.09770566, -0.00949996, -0.01662737, -0.05788678,
0.08468652, -0.22405024, 0.21742778, -0.16864617, 0.08558936,
-0.07469511, 0.05952817, 0.02368692, -0.22043513, -0.09001804,
-0.03464335, -0.05152775, 0.0397926 ], dtype=float32)
Some values are replaced with 0 , agreed, but why are other values changed ?
As the embedding outputs have a meaning and are unique for each of the words, if these are changed by applying dropout, then is it correct to apply dropout after embedding layer ?
Note- I have used "learning_phase" as 0 and 1 for testing(nodropout)
and training(droput) respectively.

It is how the dropout regularization works. After applying the dropout, the values are divided by the keeping probability (in this case 0.8).
When you use dropout, the function receives the probability of turning a neuron to zero as input, e.g., 0.2, which means it has 0.8 chance of keeping any given neuron. So, the values remaining will be multiplied by 1/(1-0.2).
This is called "inverted dropout technique" and it is done in order to ensure that the expected value of the activation remains the same. Otherwise, predictions will be wrong during inference when dropout is not used.
You'll notice that your dropout is 0.2, and all your values have been multiplied by 0.8 after you applied dropout.
Look what happens if I divide your second output bu the first:
import numpy as np
a = np.array([ 0. , -0. , -0. , -0.04656423, -0. ,
0.28391626, 0.12213208, -0.01187495, -0.02078421, -0. ,
0.10585815, -0. , 0.27178472, -0.21080771, 0. ,
-0.09336889, 0.07441022, 0.02960865, -0.2755439 , -0.11252255,
-0.04330419, -0. , 0.04974075])
b = np.array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
0.22713302, 0.09770566, -0.00949996, -0.01662737, -0.05788678,
0.08468652, -0.22405024, 0.21742778, -0.16864617, 0.08558936,
-0.07469511, 0.05952817, 0.02368692, -0.22043513, -0.09001804,
-0.03464335, -0.05152775, 0.0397926 ])
print(b/a)
[ inf inf inf 0.79999991 inf 0.80000004
0.79999997 0.8 0.8000001 inf 0.8 inf
0.80000001 0.80000001 inf 0.79999998 0.79999992 0.8
0.80000004 0.8 0.79999995 inf 0.8 ]

Could not understand keras dense layer's output

I am testing keras layer. I have built a simple dense layer with input shape is (10,2) and all value equals 1. And I use zero_initial_state to initial layer weights. However, I could not understand the output of the dense layer since it may compute the final outputs with sth. unknown. My code is:
batch_size = 10
time_steps = 30
label_num = 2.
units = 5
batch_data = tf.ones((batch_size, label_num))
dense_layer = Dense(units)
output = dense_layer(batch_data)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
print('__________________output_____________________')
print(sess.run(output))
I print the intial kernel and bias:
____________________self.kernel____________________
[[-0.6072792 0.87520194 -0.5916964 -0.28233814 0.37042332]
[ 0.24503589 -0.8950937 -0.7122175 0.67322683 0.9035703 ]]
____________________self.bias____________________
[0. 0. 0. 0. 0.]
I think the final output should be:
[[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
....
However, the final output is:
[[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
Activation is None. Why the output of the keras dense layer is this ?

Creating softmax from a tf.distributions.Categorical output layer

I'm training an agent to act in a discrete environment, and I'm using a tf.distributions.Categorical output layer which I then sample to create a softmax output to determine what action to take. I create my policy network like this:
pi_eval, _ = self._build_anet(self.state, 'pi', reuse=True)
def _build_anet(self, state_in, name, reuse=False):
w_reg = tf.contrib.layers.l2_regularizer(L2_REG)
with tf.variable_scope(name, reuse=reuse):
layer_1 = tf.layers.dense(state_in, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l1")
layer_2 = tf.layers.dense(layer_1, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l2")
a_logits = tf.layers.dense(layer_2, self.a_dim, kernel_regularizer=w_reg, name="pi_logits")
dist = tf.distributions.Categorical(logits=a_logits)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return dist, params
I then sample the network and build up a class distribution output to act as a softmax output, using the example from the tf.distributions.Categorical webpage:
n = 1e4
self.logits_action = tf.cast(tf.histogram_fixed_width(values=pi_eval.sample(int(n)), value_range=[0, 1], nbins=self.a_dim), dtype=tf.float32) / n
Run like this:
softmax = self.sess.run([self.logits_action], {self.state: state[np.newaxis, :]})
But the outputs only ever have two non-zero entries:
[0.44329998 0. 0. 0.5567 ]
[0.92139995 0. 0. 0.0786 ]
[0.95699996 0. 0. 0.043 ]
[0.7051 0. 0. 0.2949]
My hunch is something to do with value_range which the documentation says:
value_range: Shape 2 Tensor of same dtype as values. values <=
value_range[0] will be mapped to hist[0], values >= value_range1
will be mapped to hist[-1].
But I'm not sure what value range I should use? I wonder if anyone had any ideas?

Indeed, as I suspected it was something to do with the value_range and I should set the upper size to the action dimension:
value_range=[0, self.a_dim]

Tensorflow predict the class of output

I have tried the example with keras but was not with LSTM. My model is with LSTM in Tensorflow and I am willing to predict the output in the form of classes as the keras model thus with predict_classes.
The Tensorflow model I am trying is something like this:
seq_len=10
n_steps = seq_len-1
n_inputs = x_train.shape[2]
n_neurons = 50
n_outputs = y_train.shape[1]
n_layers = 2
learning_rate = 0.0001
batch_size =100
n_epochs = 1000
train_set_size = x_train.shape[0]
test_set_size = x_test.shape[0]
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_outputs])
layers = [tf.contrib.rnn.LSTMCell(num_units=n_neurons,activation=tf.nn.sigmoid, use_peepholes = True) for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs, n_outputs)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
outputs = outputs[:,n_steps-1,:]
loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
I am encoding the with sklearn LabelEncoder as:
encoder_train = LabelEncoder()
encoder_train.fit(y_train)
encoded_Y_train = encoder_train.transform(y_train)
y_train = np_utils.to_categorical(encoded_Y_train)
The data is converted to sparse matrix kinda thing in binary format.
When I tried to predict the output I got the following:
actual==> [[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]]
predicted==> [[0.3112209 0.3690182 0.31357136]
[0.31085992 0.36959863 0.31448898]
[0.31073445 0.3703295 0.31469804]
[0.31177694 0.37011752 0.3145326 ]
[0.31220382 0.3692756 0.31515726]
[0.31232828 0.36947766 0.3149037 ]
[0.31190437 0.36756667 0.31323162]
[0.31339088 0.36542615 0.310322 ]
[0.31598282 0.36328828 0.30711085]]
What I was expecting for the label based on the encoding done. As the Keras model thus. See the following:
predictions = model.predict_classes(X_test, verbose=True)
print("REAL VALUES:",reverse_category(Y_test,axis=1))
print("PRED VALUES:",predictions)
print("REAL COLORS:")
print(encoder.inverse_transform(reverse_category(Y_test,axis=1)))
print("PREDICTED COLORS:")
print(encoder.inverse_transform(predictions))
The output is something like the following:
REAL VALUES: [1 1 1 ... 1 2 1]
PRED VALUES: [2 1 1 ... 1 2 2]
REAL COLORS:
['ball' 'ball' 'ball' ... 'ball' 'bat' 'ball']
PREDICTED COLORS:
['bat' 'ball' 'ball' ... 'ball' 'bat' 'bat']
Kindly, let me know what I can do in the tensorflow model that will get me the result with respect to the encoding done.
I am using Tensorflow 1.12.0 and Windows 10

You are trying to map the predicted class probabilities back to class labels. Each row in the list of output predictions contains the three predicted class probabilities. Use np.argmax to obtain the one with the highest predicted probability in order to map to the predicted class label:
import numpy as np
predictions = [[0.3112209, 0.3690182, 0.31357136],
[0.31085992, 0.36959863, 0.31448898],
[0.31073445, 0.3703295, 0.31469804],
[0.31177694, 0.37011752, 0.3145326 ],
[0.31220382, 0.3692756, 0.31515726],
[0.31232828, 0.36947766, 0.3149037 ],
[0.31190437, 0.36756667, 0.31323162],
[0.31339088, 0.36542615, 0.310322 ],
[0.31598282, 0.36328828, 0.30711085]]
np.argmax(predictions, axis=1)
Gives:
array([1, 1, 1, 1, 1, 1, 1, 1, 1])
In this case, class 1 is predicted 9 times.
As noted in the comments: this is exactly what Keras does under the hood, as you'll see in the source code.

Keras LSTM layer output and the output of a numpy LSTM implementation are similar but not same with the same weights and Input

I modeled a two layered LSTM Keras model then I compared the output of the first LSTM layer with my simple python implementation of the LSTM layer by feeding in the same weights and Inputs. The results for the first sequence of a batch are similar but not same and from the second sequence the results deviates too far.
Below is my keras model:
For comparison of the Keras model I first created an intermediate layer where the intermediate layer outputs the result of the first layer with print(intermediate_output[0,0])for the first sequence and print(intermediate_output[0][1]) for the second sequence of the same batch then print(intermediate_output[0][127]) for the last sequence.
inputs = Input(shape=(128,9))
f1=LSTM((n_hidden),return_sequences=True,name='lstm1')(inputs)
f2=LSTM((n_hidden), return_sequences=False,name='lstm2')(f1)
fc=Dense(6,activation='softmax',kernel_regularizer=regularizers.l2(lambda_loss_amount),name='fc')(f2)
model2 = Model(inputs=inputs, outputs=fc)
layer_name = 'lstm1'
intermediate_layer_model = Model(inputs=model2.input,
outputs=model2.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(X_single_sequence[0,:,:])
print(intermediate_output[0,0]) # takes input[0][9]
print(intermediate_output[0][1]) # takes input[1][9] and hidden layer output of intermediate_output[0,0]
print(intermediate_output[0][127])
Re-Implemented first layer of the same model:
I defined LSTMlayer function where it does the same computation....after that weightLSTM loads the saved weights and x_t the same input sequence and later on h_t contains outputs for the next sequence. intermediate_out is a function corresponding to that of LSTM layer.
def sigmoid(x):
return(1.0/(1.0+np.exp(-x)))
def LSTMlayer(warr,uarr, barr,x_t,h_tm1,c_tm1):
'''
c_tm1 = np.array([0,0]).reshape(1,2)
h_tm1 = np.array([0,0]).reshape(1,2)
x_t = np.array([1]).reshape(1,1)
warr.shape = (nfeature,hunits*4)
uarr.shape = (hunits,hunits*4)
barr.shape = (hunits*4,)
'''
s_t = (x_t.dot(warr) + h_tm1.dot(uarr) + barr)
hunit = uarr.shape[0]
i = sigmoid(s_t[:,:hunit])
f = sigmoid(s_t[:,1*hunit:2*hunit])
_c = np.tanh(s_t[:,2*hunit:3*hunit])
o = sigmoid(s_t[:,3*hunit:])
c_t = i*_c + f*c_tm1
h_t = o*np.tanh(c_t)
return(h_t,c_t)
weightLSTM = model2.layers[1].get_weights()
warr,uarr, barr = weightLSTM
warr.shape,uarr.shape,barr.shape
def intermediate_out(n,warr,uarr,barr,X_test):
for i in range(0, n+1):
if i==0:
c_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][0:1][0:9],h_tm1,c_tm1)
else:
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][i:i+1][0:9],h_t,ct)
return h_t
# 1st sequence
ht0 = intermediate_out(0,warr,uarr,barr,X_test)
# 2nd sequence
ht1 = intermediate_out(1,warr,uarr,barr,X_test)
# 128th sequence
ht127 = intermediate_out(127,warr,uarr,barr,X_test)
The outputs of the keras LSTM layer from print(intermediate_output[0,0]) are as follows:
array([-0.05616369, -0.02299516, -0.00801201, 0.03872827, 0.07286803,
-0.0081161 , 0.05235862, -0.02240333, 0.0533984 , -0.08501752,
-0.04866522, 0.00254417, -0.05269946, 0.05809477, -0.08961852,
0.03975506, 0.00334282, -0.02813114, 0.01677909, -0.04411673,
-0.06751891, -0.02771493, -0.03293832, 0.04311397, -0.09430656,
-0.00269871, -0.07775293, -0.11201388, -0.08271968, -0.07464679,
-0.03533605, -0.0112953 ], dtype=float32)
and the outputs of my implementation from print(ht0) are:
array([[-0.05591469, -0.02280132, -0.00797964, 0.03681555, 0.06771626,
-0.00855897, 0.05160453, -0.02309707, 0.05746563, -0.08988875,
-0.05093143, 0.00264367, -0.05087904, 0.06033305, -0.0944235 ,
0.04066657, 0.00344291, -0.02881387, 0.01696692, -0.04101779,
-0.06718517, -0.02798996, -0.0346873 , 0.04402719, -0.10021093,
-0.00276826, -0.08390114, -0.1111543 , -0.08879325, -0.07953986,
-0.03261982, -0.01175724]], dtype=float32)
The outputs from print(intermediate_output[0][1]):
array([-0.13193817, -0.03231169, -0.02096735, 0.07571879, 0.12657365,
0.00067896, 0.09008797, -0.05597101, 0.09581321, -0.1696091 ,
-0.08893952, -0.0352162 , -0.07936387, 0.11100324, -0.19354928,
0.09691346, -0.0057206 , -0.03619875, 0.05680932, -0.08598096,
-0.13047703, -0.06360915, -0.05707538, 0.09686109, -0.18573627,
0.00711019, -0.1934243 , -0.21811798, -0.15629804, -0.17204499,
-0.07108577, 0.01727455], dtype=float32)
print(ht1):
array([[-1.34333193e-01, -3.36792655e-02, -2.06091907e-02,
7.15097040e-02, 1.18231244e-01, 7.98894180e-05,
9.03479978e-02, -5.85013032e-02, 1.06357656e-01,
-1.82848617e-01, -9.50253978e-02, -3.67032290e-02,
-7.70251378e-02, 1.16113290e-01, -2.08772928e-01,
9.89214852e-02, -5.82863577e-03, -3.79538871e-02,
6.01535551e-02, -7.99121782e-02, -1.31876275e-01,
-6.66067824e-02, -6.15542643e-02, 9.91254672e-02,
-2.00229391e-01, 7.51443207e-03, -2.13641390e-01,
-2.18286291e-01, -1.70858681e-01, -1.88928470e-01,
-6.49823472e-02, 1.72227081e-02]], dtype=float32)
print(intermediate_output[0][127]):
array([-0.46212202, 0.280646 , 0.514289 , -0.21109435, 0.53513926,
0.20116206, 0.24579187, 0.10773794, -0.6350403 , -0.0052841 ,
-0.15971565, 0.00309152, 0.04909453, 0.29789132, 0.24909772,
0.12323025, 0.15282209, 0.34281147, -0.2948742 , 0.03674917,
-0.22213924, 0.17646286, -0.12948939, 0.06568322, 0.04172657,
-0.28638166, -0.29086435, -0.6872528 , -0.12620741, 0.63395363,
-0.37212485, -0.6649531 ], dtype=float32)
print(ht127):
array([[-0.47431907, 0.29702517, 0.5428258 , -0.21381126, 0.6053808 ,
0.22849198, 0.25656056, 0.10378123, -0.6960949 , -0.09966939,
-0.20533416, -0.01677105, 0.02512029, 0.37508538, 0.35703233,
0.14703275, 0.24901289, 0.35873395, -0.32249793, 0.04093777,
-0.20691746, 0.20096642, -0.11741923, 0.06169611, 0.01019177,
-0.33316574, -0.08499744, -0.6748463 , -0.06659956, 0.71961826,
-0.4071832 , -0.6804066 ]], dtype=float32)
The outputs from (print(intermediate_output[0,0]), print(h_t[0])) and (print(intermediate_output[0][1]), print(h_t1)) are similar...but the output from print(intermediate_output[0][127]) and print(h_t127) not same and both the algorithms are implemented on the same gpu...
I saw the keras documentation and to me it seems that I am not doing anything wrong....Please comment on this and let me know that what else am I missing here ??

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Where is Keras LSTM Bias added at inference time? - python

The bias is added to the recurrent cell after the matrix multiply. It doesn't matter whether it's added to inputs after the matmul or to the recurrent data after matmul because addition is commutative. See the LSTM equations below:

Related

Using Dropout on output of embedding layer changes array values, Why?

Could not understand keras dense layer's output

Creating softmax from a tf.distributions.Categorical output layer

Tensorflow predict the class of output

Keras LSTM layer output and the output of a numpy LSTM implementation are similar but not same with the same weights and Input

Categories

Resources