Using Dropout on output of embedding layer changes array values, Why?

Using Dropout on output of embedding layer changes array values, Why? - python

Observing the outputs of embedding layer with and without dropout shows that values in the arrays are replaced with 0. But along with this why other values of array changed ?
Following is my model:-
input = Input(shape=(23,))
model = Embedding(input_dim=n_words, output_dim=23, input_length=23)(input)
model = Dropout(0.2)(model)
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)
Building model2 from trained model , with input as the input layer and output as the output of Dropout(0.2) . -
from keras import backend as K
model2 = K.function([model.layers[0].input , K.learning_phase()],
[model.layers[2].output] )
dropout = model2([X_train[0:1] , 1])[0]
nodrop = model2([X_train[0:1] , 0])[0]
Printing the first array of both dropout and no dropout:
dropout[0][0]
Output-
array([ 0. , -0. , -0. , -0.04656423, -0. ,
0.28391626, 0.12213208, -0.01187495, -0.02078421, -0. ,
0.10585815, -0. , 0.27178472, -0.21080771, 0. ,
-0.09336889, 0.07441022, 0.02960865, -0.2755439 , -0.11252255,
-0.04330419, -0. , 0.04974075], dtype=float32)
-
nodrop[0][0]
Output-
array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
0.22713302, 0.09770566, -0.00949996, -0.01662737, -0.05788678,
0.08468652, -0.22405024, 0.21742778, -0.16864617, 0.08558936,
-0.07469511, 0.05952817, 0.02368692, -0.22043513, -0.09001804,
-0.03464335, -0.05152775, 0.0397926 ], dtype=float32)
Some values are replaced with 0 , agreed, but why are other values changed ?
As the embedding outputs have a meaning and are unique for each of the words, if these are changed by applying dropout, then is it correct to apply dropout after embedding layer ?
Note- I have used "learning_phase" as 0 and 1 for testing(nodropout)
and training(droput) respectively.

It is how the dropout regularization works. After applying the dropout, the values are divided by the keeping probability (in this case 0.8).
When you use dropout, the function receives the probability of turning a neuron to zero as input, e.g., 0.2, which means it has 0.8 chance of keeping any given neuron. So, the values remaining will be multiplied by 1/(1-0.2).
This is called "inverted dropout technique" and it is done in order to ensure that the expected value of the activation remains the same. Otherwise, predictions will be wrong during inference when dropout is not used.
You'll notice that your dropout is 0.2, and all your values have been multiplied by 0.8 after you applied dropout.
Look what happens if I divide your second output bu the first:
import numpy as np
a = np.array([ 0. , -0. , -0. , -0.04656423, -0. ,
0.28391626, 0.12213208, -0.01187495, -0.02078421, -0. ,
0.10585815, -0. , 0.27178472, -0.21080771, 0. ,
-0.09336889, 0.07441022, 0.02960865, -0.2755439 , -0.11252255,
-0.04330419, -0. , 0.04974075])
b = np.array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
0.22713302, 0.09770566, -0.00949996, -0.01662737, -0.05788678,
0.08468652, -0.22405024, 0.21742778, -0.16864617, 0.08558936,
-0.07469511, 0.05952817, 0.02368692, -0.22043513, -0.09001804,
-0.03464335, -0.05152775, 0.0397926 ])
print(b/a)
[ inf inf inf 0.79999991 inf 0.80000004
0.79999997 0.8 0.8000001 inf 0.8 inf
0.80000001 0.80000001 inf 0.79999998 0.79999992 0.8
0.80000004 0.8 0.79999995 inf 0.8 ]

Related

predict() values are zero or NaN

I have a GRU model
GRU = keras.models.Sequential([keras.layers.GRU(32),
keras.layers.Dense(32, activation= 'relu'),
keras.layers.Dense(1, activation=None)])
GRU.compile(loss="mae", optimizer="adam")
resultsGRU = GRU.fit_generator(generator = train, validation_data = train, epochs = 3, verbose= 1, shuffle = False)
If I convert train data to numpy array, I can see I don't have any zero or Nan values (I also dropna values before)
trainArray= np.array(train)
print(trainArray)
I copied only a part of array, just so you can see the values:
[[array([[[-0.86286026, 0.51805955, 1.0427724 , ..., 0.27464896,
0.08823532, -1.1183959 ],
[-0.3186916 , 0.00295895, 0.740636 , ..., 0.27464896,
0.08823532, -1.1304985 ],
[-0.31057638, 0.00295895, 0.5593542 , ..., 0.27464896,
-0.5521559 , -1.1183959 ],
...,
If I print resultsGRU
print(resultsGRU.history.values())
I get
dict_values([[0.597104012966156, 0.5652544498443604, 0.5574262142181396], [0.6241905093193054, 0.6183988451957703, 0.6134349703788757]])
Then I use predict, but values are returned 0
predictGRU = GRU.predict(test)
print(predictGRU)
[0. 0. 0. ... 0. 0. 0.]
I then save this model and use it for API and the values are NaN.
What is the problem here? How do I get the model to predict a different, reasonable value?
I also use metrics later on
print(metrics.mean_absolute_error(test, predictGRU))
print(metrics.mean_squared_error(test, predictGRU))
print(metrics.explained_variance_score(test, predictGRU))
And I get normal numbers
0.6471065
0.50334525
0.23076766729354858
I don't know how to fix this on my own.

MinMax inverse_transform after LSTM + Dense(1) layers

My initial data is:
[[ 8375.5 0. 8374.14285714 8374.14285714]
[ 8354.5 0. 8383.39285714 8371.52380952]
...
[11060. 0. 11055.21428571 11032.53702732]
[11076.5 0. 11061.60714286 11038.39875701]]
I create MinMax scaler to transform data to values from 0 to 1
scaler = MinMaxScaler(feature_range = (0, 1))
T = scaler.fit_transform(T)
So data now is:
[[0.5186697 , 0. , 0.46812344, 0.46950912],
[0.5161844 , 0. , 0.46935928, 0.46915412],
...,
[0.72264636, 0. , 0.6767292 , 0.6807525 ],
[0.7198651 , 0. , 0.6785377 , 0.6833385 ]]
I do some magic to prepare this data for LSTM layer and this is the result:
X_train variable of shape (6989, 4, 200)
[[[0.5186697 0. 0.46812344 ... 0. 0.45496237 0.45219505]
[0.48742527 0. 0.45273864 ... 0. 0.43144143 0.431924 ]
[0.4800284 0. 0.43054438 ... 0. 0.425362 0.4326681 ]
[0.5007989 0. 0.4290794 ... 0. 0.4696839 0.47831726]]
...
[[0.61240304 0. 0.57254803 ... 0. 0.5749577 0.57792616]
[0.61139715 0. 0.5746571 ... 0. 0.5971378 0.6017289 ]
[0.6365465 0. 0.59772 ... 0. 0.62671924 0.63145673]
[0.65719867 0. 0.62684333 ... 0. 0.6757128 0.6772785 ]]]
I process the data using this model with Dense(1) layer at the end:
model = Sequential()
model.add(LSTM(units = 50, activation = 'relu', #return_sequences = True,
input_shape = (X_train.shape[1], window_size)))
model.add(Dropout(0.2))
model.add(Dense(1, activation = 'linear'))
model.compile(loss = 'mean_squared_error', optimizer = 'adam')
And when I set return_sequences to false the shape of new data after fit is (6989, 1) and when I want to inverse_transform scaler.inverse_transform(train_predict) using this scalar I get an error:
ValueError: non-broadcastable output operand with shape (6989,1) doesn't match the broadcast shape (6989,4)
When I do set return_sequences to true the new shape is (6989, 4, 1) and when I inverse_transform I get other error:
ValueError: Found array with dim 3. None expected <= 2.
============
I think I know why I get these errors, because scaler requires shape of (6989,4), but what and how can I do to transform this data so that I will be able to inverse_transform?
How can I inverse_transform the data of new shape of (6989, 1)?
How can I inverse_transform the data of new shape of (6989, 4, 1)?
Is it doable? My scaler can be used? Or should I create new scaler? Can you suggest something? What am I missing?
I will appreciate any help, thanks!

Many-to-many lstm model on varying samples

I have recently started learning how to build LSTM model for multivariate time series data. I have looked here and here on how to pad sequences and implement many-to-many LSTM model. I have created a dataframe to test the model but I keep getting an error (below).
d = {'ID':['a12', 'a12','a12','a12','a12','b33','b33','b33','b33','v55','v55','v55','v55','v55','v55'], 'Exp_A':[2.2,2.2,2.2,2.2,2.2,3.1,3.1,3.1,3.1,1.5,1.5,1.5,1.5,1.5,1.5],
'Exp_B':[2.4,2.4,2.4,2.4,2.4,1.2,1.2,1.2,1.2,1.5,1.5,1.5,1.5,1.5,1.5],
'A':[0,0,1,0,1,0,1,0,1,0,1,1,1,0,1], 'B':[0,0,1,1,1,0,0,1,1,1,0,0,1,0,1],
'Time_Interval': ['11:00:00', '11:10:00', '11:20:00', '11:30:00', '11:40:00',
'11:00:00', '11:10:00', '11:20:00', '11:30:00',
'11:00:00', '11:10:00', '11:20:00', '11:30:00', '11:40:00', '11:50:00']}
df = pd.DataFrame(d)
df.set_index('Time_Interval', inplace=True)
I tried to pad using brute force:
from keras.preprocessing.sequence import pad_sequences
x1 = df['A'][df['ID']== 'a12']
x2 = df['A'][df['ID']== 'b33']
x3 = df['A'][df['ID']== 'v55']
mx = df['ID'].size().max() # Find the largest group
seq1 = [x1, x2, x3]
padded1 = np.array(pad_sequences(seq1, maxlen=6, dtype='float32')).reshape(-1,mx,1)
In similar ways I have created padded2, padded3 and padded4 for each feature:
padded_data = np.dstack((padded1, padded1, padded3, padded4))
padded_data.shape = (3, 6, 4)
padded_data
array([[[0. , 0. , 0. , 0. ],
[0. , 0. , 2.2, 2.4],
[0. , 0. , 2.2, 2.4],
[1. , 1. , 2.2, 2.4],
[0. , 0. , 2.2, 2.4],
[1. , 1. , 2.2, 2.4]],
[[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 3.1, 1.2],
[1. , 1. , 3.1, 1.2],
[0. , 0. , 3.1, 1.2],
[1. , 1. , 3.1, 1.2]],
[[0. , 0. , 1.5, 1.5],
[1. , 1. , 1.5, 1.5],
[1. , 1. , 1.5, 1.5],
[1. , 1. , 1.5, 1.5],
[0. , 0. , 1.5, 1.5],
[1. , 1. , 1.5, 1.5]]], dtype=float32)
edit
#split into train/test
train = pad_1[:2] # train on the 1st two samples.
test = pad_1[-1:]
train_X = train[:,:-1] # one step ahead prediction.
train_y = train[:,1:]
test_X = test[:,:-1] # test on the last sample
test_y = test[:,1:]
# check shapes
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
#(2, 5, 4) (2, 5, 4) (1, 5, 4) (1, 5, 4)
# design network
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(train.shape[1], train.shape[2])))
model.add(LSTM(32, input_shape=(train.shape[1], train.shape[2]), return_sequences=True))
model.add(Dense(4))
model.compile(loss='mae', optimizer='adam', metrics=['accuracy'])
model.summary()
# fit network
history = model.fit(train, test, epochs=300, validation_data=(test_X, test_y), verbose=2, shuffle=False)
[![enter image description here][3]][3]
So my questions are:
Surely, there must be an efficient way of transforming the data?
Say I want a single time-step prediction for future sequence, I have
first time-step = array([[[0.5 , 0.9 , 2.5, 3.5]]], dtype=float32)
Where first time-step is a single 'frame' of a sequence.
How do adjust the model to incorporate this?

To resolve the error, remove return_sequence=True from the LSTM layer arguments (since with this architecture you have defined, you only need the output of last layer) and also simply use train[:, -1] and test[:, -1] (instead of train[:, -1:] and test[:, -1:]) to extract the labels (i.e. removing : causes the second axis to be dropped and therefore makes the labels shape consistent with the output shape of the model).
As a side note, wrapping a Dense layer inside a TimeDistributed layer is redundant, since the Dense layer is applied on the last axis.
Update: As for the new question, either pad the input sequence which has only one timestep to make it have a shape of (5,4), or alternatively set the input shape of the first layer (i.e. Masking) to input_shape=(None, train.shape[2]) so the model can work with inputs of varying length.

Returning probabilities in a classification prediction in Keras?

I am trying to make a simple proof-of-concept where I can see the probabilities of different classes for a given prediction.
However, everything I try seems to only output the predicted class, even though I am using a softmax activation. I am new to machine learning, so I'm not sure if I am making a simple mistake or if this is a feature not available in Keras.
I'm using Keras + TensorFlow. I have adapted one of the basic examples given by Keras for classifying the MNIST dataset.
My code below is exactly the same as the example, except for a few (commented) extra lines that exports the model to a local file.
'''Trains a simple deep NN on the MNIST dataset.
Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning).
2 seconds per epoch on a K520 GPU.
'''
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
import h5py # added import because it is required for model.save
model_filepath = 'test_model.h5' # added filepath config
batch_size = 128
num_classes = 10
epochs = 20
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
model.save(model_filepath) # added saving model
print('Model saved') # added log
Then the second part of this is a simple script that should import the model, predict the class for some given data, and print out the probabilities for each class. (I am using the same mnist class included with the Keras codebase to make an example as simple as possible).
import keras
from keras.datasets import mnist
from keras.models import Sequential
import keras.backend as K
import numpy
# loading model saved locally in test_model.h5
model_filepath = 'test_model.h5'
prev_model = keras.models.load_model(model_filepath)
# these lines are copied from the example for loading MNIST data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
# for this example, I am only taking the first 10 images
x_slice = x_train[slice(1, 11, 1)]
# making the prediction
prediction = prev_model.predict(x_slice)
# logging each on a separate line
for single_prediction in prediction:
print(single_prediction)
If I run the first script to export the model, then the second script to classify some examples, I get the following output:
[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
This is great for seeing which class each is predicted to be, but what if I want to see the relative probabilities of each class for each example? I am looking for something more like this:
[ 0.94 0.01 0.02 0. 0. 0.01 0. 0.01 0.01 0.]
[ 0. 0. 0. 0. 0.51 0. 0. 0. 0.49 0.]
...
In other words, I need to know how sure each prediction is, not just the prediction itself. I thought seeing the relative probabilities was a part of using a softmax activation in the model, but I can't seem to find anything in the Keras documentation that would give me probabilities instead of the predicted answer. Am I making some kind of silly mistake, or is this feature not available?

So it turns out that the problem was I was not fully normalizing the data in the prediction script.
My prediction script should have had the following lines:
# these lines are copied from the example for loading MNIST data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_train = x_train.astype('float32') # this line was missing
x_train /= 255 # this line was missing too
Because the data was not cast to float, and divided by 255 (so it would be between 0 and 1), it was just showing up as 1s and 0s.

Keras predict indeed returns probabilities, and not classes.
Cannot reproduce your issue with my system configuration:
Python version 2.7.12
Tensorflow version 1.3.0
Keras version 2.0.9
Numpy version 1.13.3
Here is my prediction output for your x_slice with the loaded model (trained for 20 epochs, as in your code):
print(prev_model.predict(x_slice))
# Result:
[[ 1.00000000e+00 3.31656316e-37 1.07806675e-21 7.11765177e-30
2.48000320e-31 5.34837679e-28 3.12470132e-24 4.65175406e-27
8.66994134e-31 5.26426367e-24]
[ 0.00000000e+00 5.34361977e-30 3.91144999e-35 0.00000000e+00
1.00000000e+00 0.00000000e+00 1.05583665e-36 1.01395577e-29
0.00000000e+00 1.70868685e-29]
[ 3.99137559e-38 1.00000000e+00 1.76682222e-24 9.33333581e-31
3.99846307e-15 1.17745576e-24 1.87529709e-26 2.18951752e-20
3.57518280e-17 1.62027896e-28]
[ 6.48006586e-26 1.48974980e-17 5.60530329e-22 1.81973780e-14
9.12573406e-10 1.95987500e-14 8.08566866e-27 1.17901132e-12
7.33970447e-13 1.00000000e+00]
[ 2.01602060e-16 6.58242856e-14 1.00000000e+00 6.84244084e-09
1.19809885e-16 7.94907624e-14 3.10690434e-19 8.02848586e-12
4.68330721e-11 5.14736501e-15]
[ 2.31014903e-35 1.00000000e+00 6.02224725e-21 2.35928828e-23
7.50006509e-15 4.06930881e-22 1.13288827e-24 4.20440718e-17
4.95182972e-17 1.85492109e-18]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
0.00000000e+00 6.30200370e-27 0.00000000e+00 5.19937755e-33
1.63205659e-31 1.21508034e-20]
[ 1.44608573e-26 1.00000000e+00 1.78712268e-18 6.84598301e-19
1.30042071e-11 2.53873986e-14 5.83169942e-17 1.20201071e-12
2.21844570e-14 3.75015198e-15]
[ 0.00000000e+00 6.29184453e-34 9.22474943e-29 0.00000000e+00
1.00000000e+00 3.05067233e-34 1.43097161e-28 1.34234082e-29
4.28647272e-36 9.29760838e-34]
[ 4.68828449e-30 5.55172479e-20 3.26705529e-19 9.99999881e-01
3.49577992e-22 1.27715460e-11 4.99185615e-36 1.19164204e-20
4.21086124e-16 1.52631387e-07]]
I suspect some rounding issue when printing (or you have trained for much more epochs, and your probabilities for the training set have gotten very close to 1)...
To convince yourself that you indeed get probabilities and not class predictions, I suggest to try getting predictions from your model trained for a single epoch; normally you should see much less 1.0's - here is the case here for a model trained for epochs=1:
print(model.predict(x_slice))
# Result:
[[ 9.99916673e-01 5.36548761e-08 6.10747229e-05 8.21199933e-07
6.64725164e-08 6.78853041e-07 9.09637220e-06 4.56192402e-06
1.62688798e-06 5.23997733e-06]
[ 7.59836894e-07 1.78043920e-05 1.79073555e-04 2.95592145e-05
9.98031914e-01 1.75839632e-05 5.90557102e-06 1.27705920e-03
3.94643757e-06 4.36416740e-04]
[ 4.48473330e-08 9.99895334e-01 2.82608235e-05 5.33154832e-07
9.78453227e-06 1.58954310e-06 3.38150176e-06 5.26260410e-05
8.09341054e-06 3.28643267e-07]
[ 7.38236849e-07 4.80247072e-05 2.81726116e-05 4.77648537e-05
7.21933879e-03 2.52177160e-05 3.88786475e-07 3.56770557e-04
2.83472677e-04 9.91990149e-01]
[ 5.03611082e-05 2.69402866e-04 9.92011130e-01 4.68175858e-03
9.57477605e-05 4.26214538e-04 7.66683661e-05 7.05923303e-04
1.45670515e-03 2.26032615e-04]
[ 1.36330849e-10 9.99994516e-01 7.69141934e-07 1.44130311e-07
9.52201333e-07 1.45219332e-07 4.43408908e-07 6.93398249e-07
2.18685204e-06 1.50741769e-07]
[ 2.39427478e-09 3.75754922e-07 3.89349816e-06 9.99889374e-01
1.85837867e-09 1.16176770e-05 1.89989760e-11 3.12301523e-07
1.13220040e-05 8.29571582e-05]
[ 1.45760115e-08 9.99900222e-01 3.67058942e-06 4.04857201e-06
1.97999962e-05 7.85745397e-06 8.13850420e-06 1.87294081e-05
2.81870762e-05 9.38157609e-06]
[ 7.52560858e-09 8.84437856e-09 9.71140025e-07 5.20911703e-10
9.99986649e-01 3.12135370e-07 1.06521384e-05 1.25693066e-06
7.21853368e-08 5.21001624e-08]
[ 8.67672298e-08 2.17907742e-04 2.45352840e-06 9.95455265e-01
1.43749105e-06 1.51766278e-03 1.83744309e-08 3.83995541e-07
9.90309782e-05 2.70584645e-03]]

Where is Keras LSTM Bias added at inference time?

The Keras LSTM implementation outputs kernel weights, recurrent weights and a single bias vector. I would have expected there to be a bias for both the kernel weights and the recurrent weights so I am trying to make sure that I understand where this bias is being applied. Consider the randomly initialized example:
test_model = Sequential()
test_model.add(LSTM(4,input_dim=5,input_length=10,return_sequences=True))
for e in zip(test_model.layers[0].trainable_weights, test_model.layers[0].get_weights()):
print('Param %s:\n%s' % (e[0],e[1]))
print(e[1].shape)
This will something like the following:
Param <tf.Variable 'lstm_3/kernel:0' shape=(5, 16) dtype=float32_ref>:
[[-0.46578053 -0.31746995 -0.33488223 0.4640277 -0.46431816 -0.0852727
0.43396038 0.12882692 -0.0822868 -0.23696694 0.4661569 0.4719978
0.12041456 -0.20120585 0.45095628 -0.1172519 ]
[ 0.04213512 -0.24420211 -0.33768272 0.11827284 -0.01744157 -0.09241
0.18402642 0.07530934 -0.28586367 -0.05161515 -0.18925312 -0.19212383
0.07093149 -0.14886391 -0.08835816 0.15116036]
[-0.09760407 -0.27473268 -0.29974532 -0.14995047 0.35970795 0.03962368
0.35579181 -0.21503082 -0.46921644 -0.47543833 -0.51497519 -0.08157375
0.4575423 0.35909468 -0.20627108 0.20574462]
[-0.19834137 0.05490702 0.13013887 -0.52255917 0.20565301 0.12259561
-0.33298236 0.2399289 -0.23061508 0.2385658 -0.08770937 -0.35886696
0.28242612 -0.49390298 -0.23676801 0.09713227]
[-0.21802655 -0.32708862 -0.2184104 -0.28524712 0.37784815 0.50567037
0.47393328 -0.05177036 0.41434419 -0.36551589 0.01406455 0.30521619
0.39916915 0.22952956 0.40699703 0.4528749 ]]
(5, 16)
Param <tf.Variable 'lstm_3/recurrent_kernel:0' shape=(4, 16) dtype=float32_ref>:
[[ 0.28626361 -0.21708137 -0.18340513 -0.02943563 -0.16822724 0.38830781
-0.50277489 -0.07898639 -0.30247116 -0.01375726 -0.34504923 -0.01373435
-0.32458451 -0.03497506 -0.01305341 0.28398186]
[-0.35822678 0.13861786 0.42913082 0.11312254 -0.1593778 0.58666271
0.09238213 -0.24134786 0.2196856 -0.01660753 -0.01929135 -0.02324873
-0.2000526 -0.07921806 -0.33966202 -0.08963238]
[-0.06521184 -0.28180376 0.00445012 -0.32302913 -0.02236169 -0.00901215
0.03330055 0.10727262 0.03839845 -0.58494729 0.36934188 -0.31894827
-0.43042961 0.01130622 0.11946538 -0.13160609]
[-0.31211731 -0.24986106 0.16157174 -0.27083701 0.14389414 -0.23260537
-0.28311059 -0.17966864 -0.28650531 -0.06572254 -0.03313115 0.23230191
0.13236329 0.44721091 -0.42978323 -0.09875761]]
(4, 16)
Param <tf.Variable 'lstm_3/bias:0' shape=(16,) dtype=float32_ref>:
[ 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
(16,)
I grasp that kernel weights are used for the linear transformation of the inputs so they are of shape [input_dim, 4 * hidden_units] or in this case [5, 16] and the kernel weights are used for the linear transformation of the recurrent weights so they are of shape [hidden_units, 4 * hidden_units]. The bias on the other hand is of shape [4 * hidden units] so it is conceivable that it could be added to the recurrent_weights, but not the input transformation. This example shows that the bias as it is output here can only be added to the recurrent_state:
embedding_dim = 5
hidden_units = 4
test_embedding = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
kernel_weights = test_model.layers[0].get_weights()[0]
recurrent_weights = test_model.layers[0].get_weights()[1]
bias = test_model.layers[0].get_weights()[2]
initial_state = np.zeros((hidden_units, 1))
input_transformation = np.dot(np.transpose(kernel_weights), test_embedding[0]) # + bias or + np.transpose(bias) won't work
recurrent_transformation = np.dot(np.transpose(recurrent_weights), initial_state) + bias
print(input_transformation.shape)
print(recurrent_transformation.shape)
Looking at this blog post there are biases added at pretty much every step, so I'm still feeling pretty lost as to where this bias is being applied.
Can anybody help me clarify where the LSTM bias is being added?

The bias is added to the recurrent cell after the matrix multiply. It doesn't matter whether it's added to inputs after the matmul or to the recurrent data after matmul because addition is commutative. See the LSTM equations below:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Dropout on output of embedding layer changes array values, Why? - python

Related

predict() values are zero or NaN

MinMax inverse_transform after LSTM + Dense(1) layers

Many-to-many lstm model on varying samples

Returning probabilities in a classification prediction in Keras?

Where is Keras LSTM Bias added at inference time?

Categories

Resources