Related
I want to do some linear algebra (e.g. tf.matmul) using the gradient. By default the gradient is returned as a list of tensors, where the tensors may have different shapes. My solution has been to reshape the gradient into a single vector. This works in eager mode, but now I want to compile my code using tf.function. It seems there is no way to write a function which can 'flatten' the gradient in graph mode (tf.function).
grad = [tf.ones((2,10)), tf.ones((3,))] # an example of what a gradient from tape.gradient can look like
# this works for flattening the gradient in eager mode only
def flatten_grad(grad):
return tf.concat([tf.reshape(grad[i], tf.math.reduce_prod(tf.shape(grad[i]))) for i in range(len(grad))], 0)
I tried converting it like this, but it doesn't work with tf.function either.
#tf.function
def flatten_grad1(grad):
temp = [None]*len(grad)
for i in tf.range(len(grad)):
i = tf.cast(i, tf.int32)
temp[i] = tf.reshape(grad[i], tf.math.reduce_prod(tf.shape(grad[i])))
return tf.concat(temp, 0)
I tried TensorArrays, but it also does not work.
#tf.function
def flatten_grad2(grad):
temp = tf.TensorArray(tf.float32, size=len(grad), infer_shape=False)
for i in tf.range(len(grad)):
i = tf.cast(i, tf.int32)
temp = temp.write(i, tf.reshape(grad[i], tf.math.reduce_prod(tf.shape(grad[i]))))
return temp.concat()
Maybe you could try directly iterating over your list of tensors instead of getting individual tensors by their index:
import tensorflow as tf
grad = [tf.ones((2,10)), tf.ones((3,))] # an example of what a gradient from tape.gradient can look like
#tf.function
def flatten_grad1(grad):
temp = [None]*len(grad)
for i, g in enumerate(grad):
temp[i] = tf.reshape(g, (tf.math.reduce_prod(tf.shape(g)), ))
return tf.concat(temp, axis=0)
print(flatten_grad1(grad))
tf.Tensor([1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.], shape=(23,), dtype=float32)
With tf.TensorArray:
#tf.function
def flatten_grad2(grad):
temp = tf.TensorArray(tf.float32, size=0, dynamic_size=True, infer_shape=False)
for g in grad:
temp = temp.write(temp.size(), tf.reshape(g, (tf.math.reduce_prod(tf.shape(g)), )))
return temp.concat()
print(flatten_grad2(grad))
Hi i think the biggest problem is the loops where in python computing loops are not encouraged.
Here's an example of how to flatten using tf functions for your gradient variables looks kind of weird normally should be a consistent shape with a batch
import tensorflow as tf
import numpy as np
#tf.function
def flatten(arr):
dim = tf.math.reduce_prod(tf.shape(arr)[1:])
return tf.reshape(arr, [-1, dim])
grad = tf.Variable(np.random.randn(100, 10, 10, 3))
flatten_grad = flatten(grad)
I am testing keras layer. I have built a simple dense layer with input shape is (10,2) and all value equals 1. And I use zero_initial_state to initial layer weights. However, I could not understand the output of the dense layer since it may compute the final outputs with sth. unknown. My code is:
batch_size = 10
time_steps = 30
label_num = 2.
units = 5
batch_data = tf.ones((batch_size, label_num))
dense_layer = Dense(units)
output = dense_layer(batch_data)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
print('__________________output_____________________')
print(sess.run(output))
I print the intial kernel and bias:
____________________self.kernel____________________
[[-0.6072792 0.87520194 -0.5916964 -0.28233814 0.37042332]
[ 0.24503589 -0.8950937 -0.7122175 0.67322683 0.9035703 ]]
____________________self.bias____________________
[0. 0. 0. 0. 0.]
I think the final output should be:
[[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
[-0.3622433 -0.01989174 -1.3039138 0.3908887 1.2739936 ]
....
However, the final output is:
[[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
[-0.25280607 1.0728977 -0.6096982 1.1957564 0.82103825]
Activation is None. Why the output of the keras dense layer is this ?
I modeled a two layered LSTM Keras model then I compared the output of the first LSTM layer with my simple python implementation of the LSTM layer by feeding in the same weights and Inputs. The results for the first sequence of a batch are similar but not same and from the second sequence the results deviates too far.
Below is my keras model:
For comparison of the Keras model I first created an intermediate layer where the intermediate layer outputs the result of the first layer with print(intermediate_output[0,0])for the first sequence and print(intermediate_output[0][1]) for the second sequence of the same batch then print(intermediate_output[0][127]) for the last sequence.
inputs = Input(shape=(128,9))
f1=LSTM((n_hidden),return_sequences=True,name='lstm1')(inputs)
f2=LSTM((n_hidden), return_sequences=False,name='lstm2')(f1)
fc=Dense(6,activation='softmax',kernel_regularizer=regularizers.l2(lambda_loss_amount),name='fc')(f2)
model2 = Model(inputs=inputs, outputs=fc)
layer_name = 'lstm1'
intermediate_layer_model = Model(inputs=model2.input,
outputs=model2.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(X_single_sequence[0,:,:])
print(intermediate_output[0,0]) # takes input[0][9]
print(intermediate_output[0][1]) # takes input[1][9] and hidden layer output of intermediate_output[0,0]
print(intermediate_output[0][127])
Re-Implemented first layer of the same model:
I defined LSTMlayer function where it does the same computation....after that weightLSTM loads the saved weights and x_t the same input sequence and later on h_t contains outputs for the next sequence. intermediate_out is a function corresponding to that of LSTM layer.
def sigmoid(x):
return(1.0/(1.0+np.exp(-x)))
def LSTMlayer(warr,uarr, barr,x_t,h_tm1,c_tm1):
'''
c_tm1 = np.array([0,0]).reshape(1,2)
h_tm1 = np.array([0,0]).reshape(1,2)
x_t = np.array([1]).reshape(1,1)
warr.shape = (nfeature,hunits*4)
uarr.shape = (hunits,hunits*4)
barr.shape = (hunits*4,)
'''
s_t = (x_t.dot(warr) + h_tm1.dot(uarr) + barr)
hunit = uarr.shape[0]
i = sigmoid(s_t[:,:hunit])
f = sigmoid(s_t[:,1*hunit:2*hunit])
_c = np.tanh(s_t[:,2*hunit:3*hunit])
o = sigmoid(s_t[:,3*hunit:])
c_t = i*_c + f*c_tm1
h_t = o*np.tanh(c_t)
return(h_t,c_t)
weightLSTM = model2.layers[1].get_weights()
warr,uarr, barr = weightLSTM
warr.shape,uarr.shape,barr.shape
def intermediate_out(n,warr,uarr,barr,X_test):
for i in range(0, n+1):
if i==0:
c_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][0:1][0:9],h_tm1,c_tm1)
else:
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][i:i+1][0:9],h_t,ct)
return h_t
# 1st sequence
ht0 = intermediate_out(0,warr,uarr,barr,X_test)
# 2nd sequence
ht1 = intermediate_out(1,warr,uarr,barr,X_test)
# 128th sequence
ht127 = intermediate_out(127,warr,uarr,barr,X_test)
The outputs of the keras LSTM layer from print(intermediate_output[0,0]) are as follows:
array([-0.05616369, -0.02299516, -0.00801201, 0.03872827, 0.07286803,
-0.0081161 , 0.05235862, -0.02240333, 0.0533984 , -0.08501752,
-0.04866522, 0.00254417, -0.05269946, 0.05809477, -0.08961852,
0.03975506, 0.00334282, -0.02813114, 0.01677909, -0.04411673,
-0.06751891, -0.02771493, -0.03293832, 0.04311397, -0.09430656,
-0.00269871, -0.07775293, -0.11201388, -0.08271968, -0.07464679,
-0.03533605, -0.0112953 ], dtype=float32)
and the outputs of my implementation from print(ht0) are:
array([[-0.05591469, -0.02280132, -0.00797964, 0.03681555, 0.06771626,
-0.00855897, 0.05160453, -0.02309707, 0.05746563, -0.08988875,
-0.05093143, 0.00264367, -0.05087904, 0.06033305, -0.0944235 ,
0.04066657, 0.00344291, -0.02881387, 0.01696692, -0.04101779,
-0.06718517, -0.02798996, -0.0346873 , 0.04402719, -0.10021093,
-0.00276826, -0.08390114, -0.1111543 , -0.08879325, -0.07953986,
-0.03261982, -0.01175724]], dtype=float32)
The outputs from print(intermediate_output[0][1]):
array([-0.13193817, -0.03231169, -0.02096735, 0.07571879, 0.12657365,
0.00067896, 0.09008797, -0.05597101, 0.09581321, -0.1696091 ,
-0.08893952, -0.0352162 , -0.07936387, 0.11100324, -0.19354928,
0.09691346, -0.0057206 , -0.03619875, 0.05680932, -0.08598096,
-0.13047703, -0.06360915, -0.05707538, 0.09686109, -0.18573627,
0.00711019, -0.1934243 , -0.21811798, -0.15629804, -0.17204499,
-0.07108577, 0.01727455], dtype=float32)
print(ht1):
array([[-1.34333193e-01, -3.36792655e-02, -2.06091907e-02,
7.15097040e-02, 1.18231244e-01, 7.98894180e-05,
9.03479978e-02, -5.85013032e-02, 1.06357656e-01,
-1.82848617e-01, -9.50253978e-02, -3.67032290e-02,
-7.70251378e-02, 1.16113290e-01, -2.08772928e-01,
9.89214852e-02, -5.82863577e-03, -3.79538871e-02,
6.01535551e-02, -7.99121782e-02, -1.31876275e-01,
-6.66067824e-02, -6.15542643e-02, 9.91254672e-02,
-2.00229391e-01, 7.51443207e-03, -2.13641390e-01,
-2.18286291e-01, -1.70858681e-01, -1.88928470e-01,
-6.49823472e-02, 1.72227081e-02]], dtype=float32)
print(intermediate_output[0][127]):
array([-0.46212202, 0.280646 , 0.514289 , -0.21109435, 0.53513926,
0.20116206, 0.24579187, 0.10773794, -0.6350403 , -0.0052841 ,
-0.15971565, 0.00309152, 0.04909453, 0.29789132, 0.24909772,
0.12323025, 0.15282209, 0.34281147, -0.2948742 , 0.03674917,
-0.22213924, 0.17646286, -0.12948939, 0.06568322, 0.04172657,
-0.28638166, -0.29086435, -0.6872528 , -0.12620741, 0.63395363,
-0.37212485, -0.6649531 ], dtype=float32)
print(ht127):
array([[-0.47431907, 0.29702517, 0.5428258 , -0.21381126, 0.6053808 ,
0.22849198, 0.25656056, 0.10378123, -0.6960949 , -0.09966939,
-0.20533416, -0.01677105, 0.02512029, 0.37508538, 0.35703233,
0.14703275, 0.24901289, 0.35873395, -0.32249793, 0.04093777,
-0.20691746, 0.20096642, -0.11741923, 0.06169611, 0.01019177,
-0.33316574, -0.08499744, -0.6748463 , -0.06659956, 0.71961826,
-0.4071832 , -0.6804066 ]], dtype=float32)
The outputs from (print(intermediate_output[0,0]), print(h_t[0])) and (print(intermediate_output[0][1]), print(h_t1)) are similar...but the output from print(intermediate_output[0][127]) and print(h_t127) not same and both the algorithms are implemented on the same gpu...
I saw the keras documentation and to me it seems that I am not doing anything wrong....Please comment on this and let me know that what else am I missing here ??
I have an embedding matrix e defined as follows
e = tf.get_variable(name="embedding", shape=[n_e, d],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
where n_e refers to the number of entities and d is the number of latent dimensions. For this example, say d=10.
Training:
optimizer = tf.train.GradientDescentOptimizer(0.01)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
The model is saved after training.
At some point later, new entities(e.g., 2) are added resulting in n_e_new. Now I would like to re-train the model, however retaining the embeddings for the already trained entities i.e., retraining only the delta (the 2 new entities).
I load the saved e and
init_e = np.zeros((n_e_new, d), dtype=np.float32)
r = list(range(n_e_new - 2))
init_e[r, :] = # load e from saved model
e = tf.get_variable(name="embedding", initializer=init_e)
gather_e = tf.nn.embedding_lookup(e, [n_e, n_e+1])
Training:
optimizer = tf.train.GradientDescentOptimizer(0.01)
grads_and_vars = optimizer.compute_gradients(loss, gather_e)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
I get an error at compute_gradients:
NotImplementedError: ('Trying to optimize unsupported type ', )
I understand that the second parameter gather_e to compute_gradients is not a variable but cannot figure out how to achieve this partial training/update.
P.S - I also had a look at this post, but cannot seem to find a solution there either.
EDIT:
Code sample(as per the approach suggested by #meruf):
if new_data_available:
e = tf.get_variable(name="embedding", shape=[n_e_new, 1, d],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
e_old = tf.get_variable(name="embedding_old", initializer=<load e from saved model>, trainable=False)
e_new = tf.concat([e_old, e], 0)
else:
e = tf.get_variable(name="embedding", shape=[n_e, d],
initializer=tf.contrib.layers.xavier_initializer(uniform=False))
Lookup is as follows:
if new_data_available:
var_p = tf.nn.embedding_lookup(e_new, indices)
else:
var_p = tf.nn.embedding_lookup(e, indices)
loss = #some operations on var_p and other variabes that are a result of the lookup above
The issue is that when new_data_available is true, neither e nor e_new change during each epoch. They remain same.
You should not change code at optimizer level. You can easily tell tensorflow which variable is trainable or not.
Let's take a look at tf.getVariable() defination,
tf.get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None,
constraint=None
)
Here trainable parameter represents that if the parameter is trainable or not. When you do not want to train a parameter then make it false.
for your case make 2 set of variable. One is trainable=True and for other trainable=false.
Assume you have 100 pretrained variable and 10 new variables to train. Now load the pretrained variable to A and new variables to B.
Note:
For implementation details, you should take a look at tf.cond function for runtime decisions. Mostly for lookup. because now your new B embeddings have index starting from 0. But you may have assigned them from # of pretrained embedding+1 in your dataset or program. So in tensorflow you can take runtime decision that
pseudocode
if index_number is >= number of pretrained embedding
index_number = index_number - number of pretrained embedding
look_up on B matrix
else
look_up on A matrix
An Ipython Notebook of the example. (slightly different than the example given here.)
update:
Let's take look at an example what I meant,
at first load the library
import tensorflow as tf
declare the placeholders
y_ = tf.placeholder(tf.float32, [None, 2])
x = tf.placeholder(tf.int32, [None])
z = tf.placeholder(tf.bool, []) # is the example in the x contains new data or not
create the network
e = tf.get_variable(name="embedding", shape=[5,10],initializer=tf.contrib.layers.xavier_initializer(uniform=False))
e_old = tf.get_variable(name="embedding1", shape=[5,10],initializer=tf.contrib.layers.xavier_initializer(uniform=False),trainable=False)
out = tf.cond(z,lambda : e, lambda : e_old)
lookup = tf.nn.embedding_lookup(out,x)
W = tf.get_variable(name="weight", shape=[10,2],initializer=tf.contrib.layers.xavier_initializer(uniform=False))
l = tf.nn.relu(tf.matmul(lookup,W))
y = tf.nn.softmax(l)
calculate loss
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
optimize loss
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
load and run the graph
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
print the initialized value
We are printing the values so that we can check later if our value changes or not.
e_out_tf,e_out_old_tf = sess.run([e,e_old])
print("New Data ", e_out_tf)
print("Old Data", e_out_old_tf)
('New Data ', array([[-0.38952214, -0.37217963, 0.11370762, -0.13024905, 0.11420489,
-0.09138191, 0.13781562, -0.1624797 , -0.27410012, -0.5404499 ],
[-0.0065698 , 0.04728106, 0.53637034, -0.13864517, -0.36171854,
0.40325132, 0.7172644 , -0.28067762, -0.0258827 , -0.5615116 ],
[-0.17240004, 0.3765518 , 0.4658525 , 0.16545495, -0.37515178,
-0.39557686, -0.50662124, -0.06570222, -0.3605038 , 0.13746035],
[ 0.19647208, -0.16588202, 0.5739292 , 0.43803877, -0.05350745,
0.71350956, 0.39937392, -0.45939735, 0.09050641, -0.18077391],
[-0.05588558, 0.7295865 , 0.42288807, 0.57227516, 0.7268311 ,
-0.1194113 , 0.28589466, 0.09422033, -0.10094754, 0.3942643 ]],
dtype=float32))
('Old Data', array([[ 0.5308224 , -0.14003026, -0.7685277 , 0.06644323, -0.02585996,
-0.1713268 , 0.04987739, 0.01220775, 0.33571896, 0.19891626],
[ 0.3288728 , -0.09298109, 0.14795913, 0.21343362, 0.14123142,
-0.19770677, 0.7366793 , 0.38711038, 0.37526497, 0.440099 ],
[-0.29200613, 0.4852043 , 0.55407804, -0.13675605, -0.2815263 ,
-0.00703347, 0.31396288, -0.7152872 , 0.0844975 , 0.4210107 ],
[ 0.5046112 , 0.3085646 , 0.19497707, -0.5193338 , -0.0429871 ,
-0.5231836 , -0.38976955, -0.2300536 , -0.00906788, -0.1689194 ],
[-0.1231837 , 0.54029703, 0.45702592, -0.07886257, -0.6420077 ,
-0.24090563, -0.02165782, -0.44103763, -0.20914222, 0.40911582]],
dtype=float32))
test case
Now we will test our theory if
1. non-trainable variable changes or not
2. trainable variable changes or not.
We declared an additional placeholder z to indicate if the our input ontains new data or old data.
Here, index 0 contains new data that is trainable if z is True.
feed_dict={x: [0],z:True}
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
check that the value matches with above value.
print(lookup_tf)
[array([[-0.38952214, -0.37217963, 0.11370762, -0.13024905, 0.11420489,
-0.09138191, 0.13781562, -0.1624797 , -0.27410012, -0.5404499 ]],
dtype=float32)]
we will send z=True to indicate on which embedding you want to lookup.
So while you send a batch make sure that the batch contains only either old data or new data.
feed_dict={x: [0], y_: [[0,1]], z:True}
_, = sess.run([train_step], feed_dict=feed_dict)
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
after training let's check is it behaves ok or not.
print(lookup_tf)
[array([[-0.559212 , -0.362611 , 0.06011545, -0.02056453, 0.26133284,
-0.24933788, 0.18598196, -0.00602196, -0.12775017, -0.6666256 ]],
dtype=float32)]
See index 0 contains new data that is trainable and changes from previous value because of SGD update.
let's try the opposite
feed_dict={x: [0], y_: [[0,1]], z:False}
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
print(lookup_tf)
_, = sess.run([train_step], feed_dict=feed_dict)
lookup_tf = sess.run([lookup], feed_dict=feed_dict)
print(lookup_tf)
[array([[ 0.5308224 , -0.14003026, -0.7685277 , 0.06644323, -0.02585996,
-0.1713268 , 0.04987739, 0.01220775, 0.33571896, 0.19891626]],
dtype=float32)]
[array([[ 0.5308224 , -0.14003026, -0.7685277 , 0.06644323, -0.02585996,
-0.1713268 , 0.04987739, 0.01220775, 0.33571896, 0.19891626]],
dtype=float32)]
The Keras LSTM implementation outputs kernel weights, recurrent weights and a single bias vector. I would have expected there to be a bias for both the kernel weights and the recurrent weights so I am trying to make sure that I understand where this bias is being applied. Consider the randomly initialized example:
test_model = Sequential()
test_model.add(LSTM(4,input_dim=5,input_length=10,return_sequences=True))
for e in zip(test_model.layers[0].trainable_weights, test_model.layers[0].get_weights()):
print('Param %s:\n%s' % (e[0],e[1]))
print(e[1].shape)
This will something like the following:
Param <tf.Variable 'lstm_3/kernel:0' shape=(5, 16) dtype=float32_ref>:
[[-0.46578053 -0.31746995 -0.33488223 0.4640277 -0.46431816 -0.0852727
0.43396038 0.12882692 -0.0822868 -0.23696694 0.4661569 0.4719978
0.12041456 -0.20120585 0.45095628 -0.1172519 ]
[ 0.04213512 -0.24420211 -0.33768272 0.11827284 -0.01744157 -0.09241
0.18402642 0.07530934 -0.28586367 -0.05161515 -0.18925312 -0.19212383
0.07093149 -0.14886391 -0.08835816 0.15116036]
[-0.09760407 -0.27473268 -0.29974532 -0.14995047 0.35970795 0.03962368
0.35579181 -0.21503082 -0.46921644 -0.47543833 -0.51497519 -0.08157375
0.4575423 0.35909468 -0.20627108 0.20574462]
[-0.19834137 0.05490702 0.13013887 -0.52255917 0.20565301 0.12259561
-0.33298236 0.2399289 -0.23061508 0.2385658 -0.08770937 -0.35886696
0.28242612 -0.49390298 -0.23676801 0.09713227]
[-0.21802655 -0.32708862 -0.2184104 -0.28524712 0.37784815 0.50567037
0.47393328 -0.05177036 0.41434419 -0.36551589 0.01406455 0.30521619
0.39916915 0.22952956 0.40699703 0.4528749 ]]
(5, 16)
Param <tf.Variable 'lstm_3/recurrent_kernel:0' shape=(4, 16) dtype=float32_ref>:
[[ 0.28626361 -0.21708137 -0.18340513 -0.02943563 -0.16822724 0.38830781
-0.50277489 -0.07898639 -0.30247116 -0.01375726 -0.34504923 -0.01373435
-0.32458451 -0.03497506 -0.01305341 0.28398186]
[-0.35822678 0.13861786 0.42913082 0.11312254 -0.1593778 0.58666271
0.09238213 -0.24134786 0.2196856 -0.01660753 -0.01929135 -0.02324873
-0.2000526 -0.07921806 -0.33966202 -0.08963238]
[-0.06521184 -0.28180376 0.00445012 -0.32302913 -0.02236169 -0.00901215
0.03330055 0.10727262 0.03839845 -0.58494729 0.36934188 -0.31894827
-0.43042961 0.01130622 0.11946538 -0.13160609]
[-0.31211731 -0.24986106 0.16157174 -0.27083701 0.14389414 -0.23260537
-0.28311059 -0.17966864 -0.28650531 -0.06572254 -0.03313115 0.23230191
0.13236329 0.44721091 -0.42978323 -0.09875761]]
(4, 16)
Param <tf.Variable 'lstm_3/bias:0' shape=(16,) dtype=float32_ref>:
[ 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
(16,)
I grasp that kernel weights are used for the linear transformation of the inputs so they are of shape [input_dim, 4 * hidden_units] or in this case [5, 16] and the kernel weights are used for the linear transformation of the recurrent weights so they are of shape [hidden_units, 4 * hidden_units]. The bias on the other hand is of shape [4 * hidden units] so it is conceivable that it could be added to the recurrent_weights, but not the input transformation. This example shows that the bias as it is output here can only be added to the recurrent_state:
embedding_dim = 5
hidden_units = 4
test_embedding = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
kernel_weights = test_model.layers[0].get_weights()[0]
recurrent_weights = test_model.layers[0].get_weights()[1]
bias = test_model.layers[0].get_weights()[2]
initial_state = np.zeros((hidden_units, 1))
input_transformation = np.dot(np.transpose(kernel_weights), test_embedding[0]) # + bias or + np.transpose(bias) won't work
recurrent_transformation = np.dot(np.transpose(recurrent_weights), initial_state) + bias
print(input_transformation.shape)
print(recurrent_transformation.shape)
Looking at this blog post there are biases added at pretty much every step, so I'm still feeling pretty lost as to where this bias is being applied.
Can anybody help me clarify where the LSTM bias is being added?
The bias is added to the recurrent cell after the matrix multiply. It doesn't matter whether it's added to inputs after the matmul or to the recurrent data after matmul because addition is commutative. See the LSTM equations below: