Could not understand keras dense layer's output - python

I am testing keras layer. I have built a simple dense layer with input shape is (10,2) and all value equals 1. And I use zero_initial_state to initial layer weights. However, I could not understand the output of the dense layer since it may compute the final outputs with sth. unknown. My code is:
batch_size = 10
time_steps = 30
label_num = 2.
units = 5
batch_data = tf.ones((batch_size, label_num))
dense_layer = Dense(units)
output = dense_layer(batch_data)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
print('__________________output_____________________')
print(sess.run(output))
I print the intial kernel and bias:
____________________self.kernel____________________
[[-0.6072792 0.87520194 -0.5916964 -0.28233814 0.37042332]
[ 0.24503589 -0.8950937 -0.7122175 0.67322683 0.9035703 ]]
____________________self.bias____________________
[0. 0. 0. 0. 0.]
I think the final output should be:
[[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
....
However, the final output is:
[[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
Activation is None. Why the output of the keras dense layer is this ?

Related

Creating softmax from a tf.distributions.Categorical output layer

I'm training an agent to act in a discrete environment, and I'm using a tf.distributions.Categorical output layer which I then sample to create a softmax output to determine what action to take. I create my policy network like this:
pi_eval, _ = self._build_anet(self.state, 'pi', reuse=True)
def _build_anet(self, state_in, name, reuse=False):
w_reg = tf.contrib.layers.l2_regularizer(L2_REG)
with tf.variable_scope(name, reuse=reuse):
layer_1 = tf.layers.dense(state_in, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l1")
layer_2 = tf.layers.dense(layer_1, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l2")
a_logits = tf.layers.dense(layer_2, self.a_dim, kernel_regularizer=w_reg, name="pi_logits")
dist = tf.distributions.Categorical(logits=a_logits)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return dist, params
I then sample the network and build up a class distribution output to act as a softmax output, using the example from the tf.distributions.Categorical webpage:
n = 1e4
self.logits_action = tf.cast(tf.histogram_fixed_width(values=pi_eval.sample(int(n)), value_range=[0, 1], nbins=self.a_dim), dtype=tf.float32) / n
Run like this:
softmax = self.sess.run([self.logits_action], {self.state: state[np.newaxis, :]})
But the outputs only ever have two non-zero entries:
[0.44329998 0. 0. 0.5567 ]
[0.92139995 0. 0. 0.0786 ]
[0.95699996 0. 0. 0.043 ]
[0.7051 0. 0. 0.2949]
My hunch is something to do with value_range which the documentation says:
value_range: Shape 2 Tensor of same dtype as values. values <=
value_range[0] will be mapped to hist[0], values >= value_range1
will be mapped to hist[-1].
But I'm not sure what value range I should use? I wonder if anyone had any ideas?
Indeed, as I suspected it was something to do with the value_range and I should set the upper size to the action dimension:
value_range=[0, self.a_dim]

Tensorflow predict the class of output

I have tried the example with keras but was not with LSTM. My model is with LSTM in Tensorflow and I am willing to predict the output in the form of classes as the keras model thus with predict_classes.
The Tensorflow model I am trying is something like this:
seq_len=10
n_steps = seq_len-1
n_inputs = x_train.shape[2]
n_neurons = 50
n_outputs = y_train.shape[1]
n_layers = 2
learning_rate = 0.0001
batch_size =100
n_epochs = 1000
train_set_size = x_train.shape[0]
test_set_size = x_test.shape[0]
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_outputs])
layers = [tf.contrib.rnn.LSTMCell(num_units=n_neurons,activation=tf.nn.sigmoid, use_peepholes = True) for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs, n_outputs)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
outputs = outputs[:,n_steps-1,:]
loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
I am encoding the with sklearn LabelEncoder as:
encoder_train = LabelEncoder()
encoder_train.fit(y_train)
encoded_Y_train = encoder_train.transform(y_train)
y_train = np_utils.to_categorical(encoded_Y_train)
The data is converted to sparse matrix kinda thing in binary format.
When I tried to predict the output I got the following:
actual==> [[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]]
predicted==> [[0.3112209 0.3690182 0.31357136]
[0.31085992 0.36959863 0.31448898]
[0.31073445 0.3703295 0.31469804]
[0.31177694 0.37011752 0.3145326 ]
[0.31220382 0.3692756 0.31515726]
[0.31232828 0.36947766 0.3149037 ]
[0.31190437 0.36756667 0.31323162]
[0.31339088 0.36542615 0.310322 ]
[0.31598282 0.36328828 0.30711085]]
What I was expecting for the label based on the encoding done. As the Keras model thus. See the following:
predictions = model.predict_classes(X_test, verbose=True)
print("REAL VALUES:",reverse_category(Y_test,axis=1))
print("PRED VALUES:",predictions)
print("REAL COLORS:")
print(encoder.inverse_transform(reverse_category(Y_test,axis=1)))
print("PREDICTED COLORS:")
print(encoder.inverse_transform(predictions))
The output is something like the following:
REAL VALUES: [1 1 1 ... 1 2 1]
PRED VALUES: [2 1 1 ... 1 2 2]
REAL COLORS:
['ball' 'ball' 'ball' ... 'ball' 'bat' 'ball']
PREDICTED COLORS:
['bat' 'ball' 'ball' ... 'ball' 'bat' 'bat']
Kindly, let me know what I can do in the tensorflow model that will get me the result with respect to the encoding done.
I am using Tensorflow 1.12.0 and Windows 10
You are trying to map the predicted class probabilities back to class labels. Each row in the list of output predictions contains the three predicted class probabilities. Use np.argmax to obtain the one with the highest predicted probability in order to map to the predicted class label:
import numpy as np
predictions = [[0.3112209, 0.3690182, 0.31357136],
[0.31085992, 0.36959863, 0.31448898],
[0.31073445, 0.3703295, 0.31469804],
[0.31177694, 0.37011752, 0.3145326 ],
[0.31220382, 0.3692756, 0.31515726],
[0.31232828, 0.36947766, 0.3149037 ],
[0.31190437, 0.36756667, 0.31323162],
[0.31339088, 0.36542615, 0.310322 ],
[0.31598282, 0.36328828, 0.30711085]]
np.argmax(predictions, axis=1)
Gives:
array([1, 1, 1, 1, 1, 1, 1, 1, 1])
In this case, class 1 is predicted 9 times.
As noted in the comments: this is exactly what Keras does under the hood, as you'll see in the source code.

Keras LSTM layer output and the output of a numpy LSTM implementation are similar but not same with the same weights and Input

I modeled a two layered LSTM Keras model then I compared the output of the first LSTM layer with my simple python implementation of the LSTM layer by feeding in the same weights and Inputs. The results for the first sequence of a batch are similar but not same and from the second sequence the results deviates too far.
Below is my keras model:
For comparison of the Keras model I first created an intermediate layer where the intermediate layer outputs the result of the first layer with print(intermediate_output[0,0])for the first sequence and print(intermediate_output[0][1]) for the second sequence of the same batch then print(intermediate_output[0][127]) for the last sequence.
inputs = Input(shape=(128,9))
f1=LSTM((n_hidden),return_sequences=True,name='lstm1')(inputs)
f2=LSTM((n_hidden), return_sequences=False,name='lstm2')(f1)
fc=Dense(6,activation='softmax',kernel_regularizer=regularizers.l2(lambda_loss_amount),name='fc')(f2)
model2 = Model(inputs=inputs, outputs=fc)
layer_name = 'lstm1'
intermediate_layer_model = Model(inputs=model2.input,
outputs=model2.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(X_single_sequence[0,:,:])
print(intermediate_output[0,0]) # takes input[0][9]
print(intermediate_output[0][1]) # takes input[1][9] and hidden layer output of intermediate_output[0,0]
print(intermediate_output[0][127])
Re-Implemented first layer of the same model:
I defined LSTMlayer function where it does the same computation....after that weightLSTM loads the saved weights and x_t the same input sequence and later on h_t contains outputs for the next sequence. intermediate_out is a function corresponding to that of LSTM layer.
def sigmoid(x):
return(1.0/(1.0+np.exp(-x)))
def LSTMlayer(warr,uarr, barr,x_t,h_tm1,c_tm1):
'''
c_tm1 = np.array([0,0]).reshape(1,2)
h_tm1 = np.array([0,0]).reshape(1,2)
x_t = np.array([1]).reshape(1,1)
warr.shape = (nfeature,hunits*4)
uarr.shape = (hunits,hunits*4)
barr.shape = (hunits*4,)
'''
s_t = (x_t.dot(warr) + h_tm1.dot(uarr) + barr)
hunit = uarr.shape[0]
i = sigmoid(s_t[:,:hunit])
f = sigmoid(s_t[:,1*hunit:2*hunit])
_c = np.tanh(s_t[:,2*hunit:3*hunit])
o = sigmoid(s_t[:,3*hunit:])
c_t = i*_c + f*c_tm1
h_t = o*np.tanh(c_t)
return(h_t,c_t)
weightLSTM = model2.layers[1].get_weights()
warr,uarr, barr = weightLSTM
warr.shape,uarr.shape,barr.shape
def intermediate_out(n,warr,uarr,barr,X_test):
for i in range(0, n+1):
if i==0:
c_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][0:1][0:9],h_tm1,c_tm1)
else:
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][i:i+1][0:9],h_t,ct)
return h_t
# 1st sequence
ht0 = intermediate_out(0,warr,uarr,barr,X_test)
# 2nd sequence
ht1 = intermediate_out(1,warr,uarr,barr,X_test)
# 128th sequence
ht127 = intermediate_out(127,warr,uarr,barr,X_test)
The outputs of the keras LSTM layer from print(intermediate_output[0,0]) are as follows:
array([-0.05616369, -0.02299516, -0.00801201, 0.03872827, 0.07286803,
-0.0081161 , 0.05235862, -0.02240333, 0.0533984 , -0.08501752,
-0.04866522, 0.00254417, -0.05269946, 0.05809477, -0.08961852,
0.03975506, 0.00334282, -0.02813114, 0.01677909, -0.04411673,
-0.06751891, -0.02771493, -0.03293832, 0.04311397, -0.09430656,
-0.00269871, -0.07775293, -0.11201388, -0.08271968, -0.07464679,
-0.03533605, -0.0112953 ], dtype=float32)
and the outputs of my implementation from print(ht0) are:
array([[-0.05591469, -0.02280132, -0.00797964, 0.03681555, 0.06771626,
-0.00855897, 0.05160453, -0.02309707, 0.05746563, -0.08988875,
-0.05093143, 0.00264367, -0.05087904, 0.06033305, -0.0944235 ,
0.04066657, 0.00344291, -0.02881387, 0.01696692, -0.04101779,
-0.06718517, -0.02798996, -0.0346873 , 0.04402719, -0.10021093,
-0.00276826, -0.08390114, -0.1111543 , -0.08879325, -0.07953986,
-0.03261982, -0.01175724]], dtype=float32)
The outputs from print(intermediate_output[0][1]):
array([-0.13193817, -0.03231169, -0.02096735, 0.07571879, 0.12657365,
0.00067896, 0.09008797, -0.05597101, 0.09581321, -0.1696091 ,
-0.08893952, -0.0352162 , -0.07936387, 0.11100324, -0.19354928,
0.09691346, -0.0057206 , -0.03619875, 0.05680932, -0.08598096,
-0.13047703, -0.06360915, -0.05707538, 0.09686109, -0.18573627,
0.00711019, -0.1934243 , -0.21811798, -0.15629804, -0.17204499,
-0.07108577, 0.01727455], dtype=float32)
print(ht1):
array([[-1.34333193e-01, -3.36792655e-02, -2.06091907e-02,
7.15097040e-02, 1.18231244e-01, 7.98894180e-05,
9.03479978e-02, -5.85013032e-02, 1.06357656e-01,
-1.82848617e-01, -9.50253978e-02, -3.67032290e-02,
-7.70251378e-02, 1.16113290e-01, -2.08772928e-01,
9.89214852e-02, -5.82863577e-03, -3.79538871e-02,
6.01535551e-02, -7.99121782e-02, -1.31876275e-01,
-6.66067824e-02, -6.15542643e-02, 9.91254672e-02,
-2.00229391e-01, 7.51443207e-03, -2.13641390e-01,
-2.18286291e-01, -1.70858681e-01, -1.88928470e-01,
-6.49823472e-02, 1.72227081e-02]], dtype=float32)
print(intermediate_output[0][127]):
array([-0.46212202, 0.280646 , 0.514289 , -0.21109435, 0.53513926,
0.20116206, 0.24579187, 0.10773794, -0.6350403 , -0.0052841 ,
-0.15971565, 0.00309152, 0.04909453, 0.29789132, 0.24909772,
0.12323025, 0.15282209, 0.34281147, -0.2948742 , 0.03674917,
-0.22213924, 0.17646286, -0.12948939, 0.06568322, 0.04172657,
-0.28638166, -0.29086435, -0.6872528 , -0.12620741, 0.63395363,
-0.37212485, -0.6649531 ], dtype=float32)
print(ht127):
array([[-0.47431907, 0.29702517, 0.5428258 , -0.21381126, 0.6053808 ,
0.22849198, 0.25656056, 0.10378123, -0.6960949 , -0.09966939,
-0.20533416, -0.01677105, 0.02512029, 0.37508538, 0.35703233,
0.14703275, 0.24901289, 0.35873395, -0.32249793, 0.04093777,
-0.20691746, 0.20096642, -0.11741923, 0.06169611, 0.01019177,
-0.33316574, -0.08499744, -0.6748463 , -0.06659956, 0.71961826,
-0.4071832 , -0.6804066 ]], dtype=float32)
The outputs from (print(intermediate_output[0,0]), print(h_t[0])) and (print(intermediate_output[0][1]), print(h_t1)) are similar...but the output from print(intermediate_output[0][127]) and print(h_t127) not same and both the algorithms are implemented on the same gpu...
I saw the keras documentation and to me it seems that I am not doing anything wrong....Please comment on this and let me know that what else am I missing here ??

Tensorflow - Train only a subset of embedding matrix

I have an embedding matrix e defined as follows
e = tf.get_variable(name="embedding", shape=[n_e, d],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
where n_e refers to the number of entities and d is the number of latent dimensions. For this example, say d=10.
Training:
optimizer = tf.train.GradientDescentOptimizer(0.01)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
The model is saved after training.
At some point later, new entities(e.g., 2) are added resulting in n_e_new. Now I would like to re-train the model, however retaining the embeddings for the already trained entities i.e., retraining only the delta (the 2 new entities).
I load the saved e and
init_e = np.zeros((n_e_new, d), dtype=np.float32)
r = list(range(n_e_new - 2))
init_e[r, :] = # load e from saved model
e = tf.get_variable(name="embedding", initializer=init_e)
gather_e = tf.nn.embedding_lookup(e, [n_e, n_e+1])
Training:
optimizer = tf.train.GradientDescentOptimizer(0.01)
grads_and_vars = optimizer.compute_gradients(loss, gather_e)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
I get an error at compute_gradients:
NotImplementedError: ('Trying to optimize unsupported type ', )
I understand that the second parameter gather_e to compute_gradients is not a variable but cannot figure out how to achieve this partial training/update.
P.S - I also had a look at this post, but cannot seem to find a solution there either.
EDIT:
Code sample(as per the approach suggested by #meruf):
if new_data_available:
e = tf.get_variable(name="embedding", shape=[n_e_new, 1, d],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
e_old = tf.get_variable(name="embedding_old", initializer=<load e from saved model>, trainable=False)
e_new = tf.concat([e_old, e], 0)
else:
e = tf.get_variable(name="embedding", shape=[n_e, d],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
Lookup is as follows:
if new_data_available:
var_p = tf.nn.embedding_lookup(e_new, indices)
else:
var_p = tf.nn.embedding_lookup(e, indices)
loss = #some operations on var_p and other variabes that are a result of the lookup above
The issue is that when new_data_available is true, neither e nor e_new change during each epoch. They remain same.
You should not change code at optimizer level. You can easily tell tensorflow which variable is trainable or not.
Let's take a look at tf.getVariable() defination,
tf.get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None,
constraint=None
)
Here trainable parameter represents that if the parameter is trainable or not. When you do not want to train a parameter then make it false.
for your case make 2 set of variable. One is trainable=True and for other trainable=false.
Assume you have 100 pretrained variable and 10 new variables to train. Now load the pretrained variable to A and new variables to B.
Note:
For implementation details, you should take a look at tf.cond function for runtime decisions. Mostly for lookup. because now your new B embeddings have index starting from 0. But you may have assigned them from # of pretrained embedding+1 in your dataset or program. So in tensorflow you can take runtime decision that
pseudocode
if index_number is >= number of pretrained embedding
index_number = index_number - number of pretrained embedding
look_up on B matrix
else
look_up on A matrix
An Ipython Notebook of the example. (slightly different than the example given here.)
update:
Let's take look at an example what I meant,
at first load the library
import tensorflow as tf
declare the placeholders
y_ = tf.placeholder(tf.float32, [None, 2])
x = tf.placeholder(tf.int32, [None])
z = tf.placeholder(tf.bool, []) # is the example in the x contains new data or not
create the network
e = tf.get_variable(name="embedding", shape=[5,10],initializer=tf.contrib.layers.xavier_initializer(uniform=False))
e_old = tf.get_variable(name="embedding1", shape=[5,10],initializer=tf.contrib.layers.xavier_initializer(uniform=False),trainable=False)
out = tf.cond(z,lambda : e, lambda : e_old)
lookup = tf.nn.embedding_lookup(out,x)
W = tf.get_variable(name="weight", shape=[10,2],initializer=tf.contrib.layers.xavier_initializer(uniform=False))
l = tf.nn.relu(tf.matmul(lookup,W))
y = tf.nn.softmax(l)
calculate loss
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
optimize loss
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
load and run the graph
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
print the initialized value
We are printing the values so that we can check later if our value changes or not.
e_out_tf,e_out_old_tf = sess.run([e,e_old])
print("New Data ", e_out_tf)
print("Old Data", e_out_old_tf)
('New Data ', array([[-0.38952214, -0.37217963, 0.11370762, -0.13024905, 0.11420489,
-0.09138191, 0.13781562, -0.1624797 , -0.27410012, -0.5404499 ],
[-0.0065698 , 0.04728106, 0.53637034, -0.13864517, -0.36171854,
0.40325132, 0.7172644 , -0.28067762, -0.0258827 , -0.5615116 ],
[-0.17240004, 0.3765518 , 0.4658525 , 0.16545495, -0.37515178,
-0.39557686, -0.50662124, -0.06570222, -0.3605038 , 0.13746035],
[ 0.19647208, -0.16588202, 0.5739292 , 0.43803877, -0.05350745,
0.71350956, 0.39937392, -0.45939735, 0.09050641, -0.18077391],
[-0.05588558, 0.7295865 , 0.42288807, 0.57227516, 0.7268311 ,
-0.1194113 , 0.28589466, 0.09422033, -0.10094754, 0.3942643 ]],
dtype=float32))
('Old Data', array([[ 0.5308224 , -0.14003026, -0.7685277 , 0.06644323, -0.02585996,
-0.1713268 , 0.04987739, 0.01220775, 0.33571896, 0.19891626],
[ 0.3288728 , -0.09298109, 0.14795913, 0.21343362, 0.14123142,
-0.19770677, 0.7366793 , 0.38711038, 0.37526497, 0.440099 ],
[-0.29200613, 0.4852043 , 0.55407804, -0.13675605, -0.2815263 ,
-0.00703347, 0.31396288, -0.7152872 , 0.0844975 , 0.4210107 ],
[ 0.5046112 , 0.3085646 , 0.19497707, -0.5193338 , -0.0429871 ,
-0.5231836 , -0.38976955, -0.2300536 , -0.00906788, -0.1689194 ],
[-0.1231837 , 0.54029703, 0.45702592, -0.07886257, -0.6420077 ,
-0.24090563, -0.02165782, -0.44103763, -0.20914222, 0.40911582]],
dtype=float32))
test case
Now we will test our theory if
1. non-trainable variable changes or not
2. trainable variable changes or not.
We declared an additional placeholder z to indicate if the our input ontains new data or old data.
Here, index 0 contains new data that is trainable if z is True.
feed_dict={x: [0],z:True}
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
check that the value matches with above value.
print(lookup_tf)
[array([[-0.38952214, -0.37217963, 0.11370762, -0.13024905, 0.11420489,
-0.09138191, 0.13781562, -0.1624797 , -0.27410012, -0.5404499 ]],
dtype=float32)]
we will send z=True to indicate on which embedding you want to lookup.
So while you send a batch make sure that the batch contains only either old data or new data.
feed_dict={x: [0], y_: [[0,1]], z:True}
_, = sess.run([train_step], feed_dict=feed_dict)
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
after training let's check is it behaves ok or not.
print(lookup_tf)
[array([[-0.559212 , -0.362611 , 0.06011545, -0.02056453, 0.26133284,
-0.24933788, 0.18598196, -0.00602196, -0.12775017, -0.6666256 ]],
dtype=float32)]
See index 0 contains new data that is trainable and changes from previous value because of SGD update.
let's try the opposite
feed_dict={x: [0], y_: [[0,1]], z:False}
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
print(lookup_tf)
_, = sess.run([train_step], feed_dict=feed_dict)
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
print(lookup_tf)
[array([[ 0.5308224 , -0.14003026, -0.7685277 , 0.06644323, -0.02585996,
-0.1713268 , 0.04987739, 0.01220775, 0.33571896, 0.19891626]],
dtype=float32)]
[array([[ 0.5308224 , -0.14003026, -0.7685277 , 0.06644323, -0.02585996,
-0.1713268 , 0.04987739, 0.01220775, 0.33571896, 0.19891626]],
dtype=float32)]

Where is Keras LSTM Bias added at inference time?

The Keras LSTM implementation outputs kernel weights, recurrent weights and a single bias vector. I would have expected there to be a bias for both the kernel weights and the recurrent weights so I am trying to make sure that I understand where this bias is being applied. Consider the randomly initialized example:
test_model = Sequential()
test_model.add(LSTM(4,input_dim=5,input_length=10,return_sequences=True))
for e in zip(test_model.layers[0].trainable_weights, test_model.layers[0].get_weights()):
print('Param %s:\n%s' % (e[0],e[1]))
print(e[1].shape)
This will something like the following:
Param <tf.Variable 'lstm_3/kernel:0' shape=(5, 16) dtype=float32_ref>:
[[-0.46578053 -0.31746995 -0.33488223 0.4640277 -0.46431816 -0.0852727
0.43396038 0.12882692 -0.0822868 -0.23696694 0.4661569 0.4719978
0.12041456 -0.20120585 0.45095628 -0.1172519 ]
[ 0.04213512 -0.24420211 -0.33768272 0.11827284 -0.01744157 -0.09241
0.18402642 0.07530934 -0.28586367 -0.05161515 -0.18925312 -0.19212383
0.07093149 -0.14886391 -0.08835816 0.15116036]
[-0.09760407 -0.27473268 -0.29974532 -0.14995047 0.35970795 0.03962368
0.35579181 -0.21503082 -0.46921644 -0.47543833 -0.51497519 -0.08157375
0.4575423 0.35909468 -0.20627108 0.20574462]
[-0.19834137 0.05490702 0.13013887 -0.52255917 0.20565301 0.12259561
-0.33298236 0.2399289 -0.23061508 0.2385658 -0.08770937 -0.35886696
0.28242612 -0.49390298 -0.23676801 0.09713227]
[-0.21802655 -0.32708862 -0.2184104 -0.28524712 0.37784815 0.50567037
0.47393328 -0.05177036 0.41434419 -0.36551589 0.01406455 0.30521619
0.39916915 0.22952956 0.40699703 0.4528749 ]]
(5, 16)
Param <tf.Variable 'lstm_3/recurrent_kernel:0' shape=(4, 16) dtype=float32_ref>:
[[ 0.28626361 -0.21708137 -0.18340513 -0.02943563 -0.16822724 0.38830781
-0.50277489 -0.07898639 -0.30247116 -0.01375726 -0.34504923 -0.01373435
-0.32458451 -0.03497506 -0.01305341 0.28398186]
[-0.35822678 0.13861786 0.42913082 0.11312254 -0.1593778 0.58666271
0.09238213 -0.24134786 0.2196856 -0.01660753 -0.01929135 -0.02324873
-0.2000526 -0.07921806 -0.33966202 -0.08963238]
[-0.06521184 -0.28180376 0.00445012 -0.32302913 -0.02236169 -0.00901215
0.03330055 0.10727262 0.03839845 -0.58494729 0.36934188 -0.31894827
-0.43042961 0.01130622 0.11946538 -0.13160609]
[-0.31211731 -0.24986106 0.16157174 -0.27083701 0.14389414 -0.23260537
-0.28311059 -0.17966864 -0.28650531 -0.06572254 -0.03313115 0.23230191
0.13236329 0.44721091 -0.42978323 -0.09875761]]
(4, 16)
Param <tf.Variable 'lstm_3/bias:0' shape=(16,) dtype=float32_ref>:
[ 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
(16,)
I grasp that kernel weights are used for the linear transformation of the inputs so they are of shape [input_dim, 4 * hidden_units] or in this case [5, 16] and the kernel weights are used for the linear transformation of the recurrent weights so they are of shape [hidden_units, 4 * hidden_units]. The bias on the other hand is of shape [4 * hidden units] so it is conceivable that it could be added to the recurrent_weights, but not the input transformation. This example shows that the bias as it is output here can only be added to the recurrent_state:
embedding_dim = 5
hidden_units = 4
test_embedding = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
kernel_weights = test_model.layers[0].get_weights()[0]
recurrent_weights = test_model.layers[0].get_weights()[1]
bias = test_model.layers[0].get_weights()[2]
initial_state = np.zeros((hidden_units, 1))
input_transformation = np.dot(np.transpose(kernel_weights), test_embedding[0]) # + bias or + np.transpose(bias) won't work
recurrent_transformation = np.dot(np.transpose(recurrent_weights), initial_state) + bias
print(input_transformation.shape)
print(recurrent_transformation.shape)
Looking at this blog post there are biases added at pretty much every step, so I'm still feeling pretty lost as to where this bias is being applied.
Can anybody help me clarify where the LSTM bias is being added?
The bias is added to the recurrent cell after the matrix multiply. It doesn't matter whether it's added to inputs after the matmul or to the recurrent data after matmul because addition is commutative. See the LSTM equations below:

Categories