Observing the outputs of embedding layer with and without dropout shows that values in the arrays are replaced with 0. But along with this why other values of array changed ?
Following is my model:-
input = Input(shape=(23,))
model = Embedding(input_dim=n_words, output_dim=23, input_length=23)(input)
model = Dropout(0.2)(model)
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)
Building model2 from trained model , with input as the input layer and output as the output of Dropout(0.2) . -
from keras import backend as K
model2 = K.function([model.layers[0].input , K.learning_phase()],
[model.layers[2].output] )
dropout = model2([X_train[0:1] , 1])[0]
nodrop = model2([X_train[0:1] , 0])[0]
Printing the first array of both dropout and no dropout:
dropout[0][0]
Output-
array([ 0. , -0. , -0. , -0.04656423, -0. ,
0.28391626, 0.12213208, -0.01187495, -0.02078421, -0. ,
0.10585815, -0. , 0.27178472, -0.21080771, 0. ,
-0.09336889, 0.07441022, 0.02960865, -0.2755439 , -0.11252255,
-0.04330419, -0. , 0.04974075], dtype=float32)
-
nodrop[0][0]
Output-
array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
0.22713302, 0.09770566, -0.00949996, -0.01662737, -0.05788678,
0.08468652, -0.22405024, 0.21742778, -0.16864617, 0.08558936,
-0.07469511, 0.05952817, 0.02368692, -0.22043513, -0.09001804,
-0.03464335, -0.05152775, 0.0397926 ], dtype=float32)
Some values are replaced with 0 , agreed, but why are other values changed ?
As the embedding outputs have a meaning and are unique for each of the words, if these are changed by applying dropout, then is it correct to apply dropout after embedding layer ?
Note- I have used "learning_phase" as 0 and 1 for testing(nodropout)
and training(droput) respectively.
It is how the dropout regularization works. After applying the dropout, the values are divided by the keeping probability (in this case 0.8).
When you use dropout, the function receives the probability of turning a neuron to zero as input, e.g., 0.2, which means it has 0.8 chance of keeping any given neuron. So, the values remaining will be multiplied by 1/(1-0.2).
This is called "inverted dropout technique" and it is done in order to ensure that the expected value of the activation remains the same. Otherwise, predictions will be wrong during inference when dropout is not used.
You'll notice that your dropout is 0.2, and all your values have been multiplied by 0.8 after you applied dropout.
Look what happens if I divide your second output bu the first:
import numpy as np
a = np.array([ 0. , -0. , -0. , -0.04656423, -0. ,
0.28391626, 0.12213208, -0.01187495, -0.02078421, -0. ,
0.10585815, -0. , 0.27178472, -0.21080771, 0. ,
-0.09336889, 0.07441022, 0.02960865, -0.2755439 , -0.11252255,
-0.04330419, -0. , 0.04974075])
b = np.array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
0.22713302, 0.09770566, -0.00949996, -0.01662737, -0.05788678,
0.08468652, -0.22405024, 0.21742778, -0.16864617, 0.08558936,
-0.07469511, 0.05952817, 0.02368692, -0.22043513, -0.09001804,
-0.03464335, -0.05152775, 0.0397926 ])
print(b/a)
[ inf inf inf 0.79999991 inf 0.80000004
0.79999997 0.8 0.8000001 inf 0.8 inf
0.80000001 0.80000001 inf 0.79999998 0.79999992 0.8
0.80000004 0.8 0.79999995 inf 0.8 ]
I'm training an agent to act in a discrete environment, and I'm using a tf.distributions.Categorical output layer which I then sample to create a softmax output to determine what action to take. I create my policy network like this:
pi_eval, _ = self._build_anet(self.state, 'pi', reuse=True)
def _build_anet(self, state_in, name, reuse=False):
w_reg = tf.contrib.layers.l2_regularizer(L2_REG)
with tf.variable_scope(name, reuse=reuse):
layer_1 = tf.layers.dense(state_in, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l1")
layer_2 = tf.layers.dense(layer_1, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l2")
a_logits = tf.layers.dense(layer_2, self.a_dim, kernel_regularizer=w_reg, name="pi_logits")
dist = tf.distributions.Categorical(logits=a_logits)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return dist, params
I then sample the network and build up a class distribution output to act as a softmax output, using the example from the tf.distributions.Categorical webpage:
n = 1e4
self.logits_action = tf.cast(tf.histogram_fixed_width(values=pi_eval.sample(int(n)), value_range=[0, 1], nbins=self.a_dim), dtype=tf.float32) / n
Run like this:
softmax = self.sess.run([self.logits_action], {self.state: state[np.newaxis, :]})
But the outputs only ever have two non-zero entries:
[0.44329998 0. 0. 0.5567 ]
[0.92139995 0. 0. 0.0786 ]
[0.95699996 0. 0. 0.043 ]
[0.7051 0. 0. 0.2949]
My hunch is something to do with value_range which the documentation says:
value_range: Shape 2 Tensor of same dtype as values. values <=
value_range[0] will be mapped to hist[0], values >= value_range1
will be mapped to hist[-1].
But I'm not sure what value range I should use? I wonder if anyone had any ideas?
Indeed, as I suspected it was something to do with the value_range and I should set the upper size to the action dimension:
value_range=[0, self.a_dim]
I have tried the example with keras but was not with LSTM. My model is with LSTM in Tensorflow and I am willing to predict the output in the form of classes as the keras model thus with predict_classes.
The Tensorflow model I am trying is something like this:
seq_len=10
n_steps = seq_len-1
n_inputs = x_train.shape[2]
n_neurons = 50
n_outputs = y_train.shape[1]
n_layers = 2
learning_rate = 0.0001
batch_size =100
n_epochs = 1000
train_set_size = x_train.shape[0]
test_set_size = x_test.shape[0]
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_outputs])
layers = [tf.contrib.rnn.LSTMCell(num_units=n_neurons,activation=tf.nn.sigmoid, use_peepholes = True) for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs, n_outputs)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
outputs = outputs[:,n_steps-1,:]
loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
I am encoding the with sklearn LabelEncoder as:
encoder_train = LabelEncoder()
encoder_train.fit(y_train)
encoded_Y_train = encoder_train.transform(y_train)
y_train = np_utils.to_categorical(encoded_Y_train)
The data is converted to sparse matrix kinda thing in binary format.
When I tried to predict the output I got the following:
actual==> [[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]]
predicted==> [[0.3112209 0.3690182 0.31357136]
[0.31085992 0.36959863 0.31448898]
[0.31073445 0.3703295 0.31469804]
[0.31177694 0.37011752 0.3145326 ]
[0.31220382 0.3692756 0.31515726]
[0.31232828 0.36947766 0.3149037 ]
[0.31190437 0.36756667 0.31323162]
[0.31339088 0.36542615 0.310322 ]
[0.31598282 0.36328828 0.30711085]]
What I was expecting for the label based on the encoding done. As the Keras model thus. See the following:
predictions = model.predict_classes(X_test, verbose=True)
print("REAL VALUES:",reverse_category(Y_test,axis=1))
print("PRED VALUES:",predictions)
print("REAL COLORS:")
print(encoder.inverse_transform(reverse_category(Y_test,axis=1)))
print("PREDICTED COLORS:")
print(encoder.inverse_transform(predictions))
The output is something like the following:
REAL VALUES: [1 1 1 ... 1 2 1]
PRED VALUES: [2 1 1 ... 1 2 2]
REAL COLORS:
['ball' 'ball' 'ball' ... 'ball' 'bat' 'ball']
PREDICTED COLORS:
['bat' 'ball' 'ball' ... 'ball' 'bat' 'bat']
Kindly, let me know what I can do in the tensorflow model that will get me the result with respect to the encoding done.
I am using Tensorflow 1.12.0 and Windows 10
You are trying to map the predicted class probabilities back to class labels. Each row in the list of output predictions contains the three predicted class probabilities. Use np.argmax to obtain the one with the highest predicted probability in order to map to the predicted class label:
import numpy as np
predictions = [[0.3112209, 0.3690182, 0.31357136],
[0.31085992, 0.36959863, 0.31448898],
[0.31073445, 0.3703295, 0.31469804],
[0.31177694, 0.37011752, 0.3145326 ],
[0.31220382, 0.3692756, 0.31515726],
[0.31232828, 0.36947766, 0.3149037 ],
[0.31190437, 0.36756667, 0.31323162],
[0.31339088, 0.36542615, 0.310322 ],
[0.31598282, 0.36328828, 0.30711085]]
np.argmax(predictions, axis=1)
Gives:
array([1, 1, 1, 1, 1, 1, 1, 1, 1])
In this case, class 1 is predicted 9 times.
As noted in the comments: this is exactly what Keras does under the hood, as you'll see in the source code.
I modeled a two layered LSTM Keras model then I compared the output of the first LSTM layer with my simple python implementation of the LSTM layer by feeding in the same weights and Inputs. The results for the first sequence of a batch are similar but not same and from the second sequence the results deviates too far.
Below is my keras model:
For comparison of the Keras model I first created an intermediate layer where the intermediate layer outputs the result of the first layer with print(intermediate_output[0,0])for the first sequence and print(intermediate_output[0][1]) for the second sequence of the same batch then print(intermediate_output[0][127]) for the last sequence.
inputs = Input(shape=(128,9))
f1=LSTM((n_hidden),return_sequences=True,name='lstm1')(inputs)
f2=LSTM((n_hidden), return_sequences=False,name='lstm2')(f1)
fc=Dense(6,activation='softmax',kernel_regularizer=regularizers.l2(lambda_loss_amount),name='fc')(f2)
model2 = Model(inputs=inputs, outputs=fc)
layer_name = 'lstm1'
intermediate_layer_model = Model(inputs=model2.input,
outputs=model2.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(X_single_sequence[0,:,:])
print(intermediate_output[0,0]) # takes input[0][9]
print(intermediate_output[0][1]) # takes input[1][9] and hidden layer output of intermediate_output[0,0]
print(intermediate_output[0][127])
Re-Implemented first layer of the same model:
I defined LSTMlayer function where it does the same computation....after that weightLSTM loads the saved weights and x_t the same input sequence and later on h_t contains outputs for the next sequence. intermediate_out is a function corresponding to that of LSTM layer.
def sigmoid(x):
return(1.0/(1.0+np.exp(-x)))
def LSTMlayer(warr,uarr, barr,x_t,h_tm1,c_tm1):
'''
c_tm1 = np.array([0,0]).reshape(1,2)
h_tm1 = np.array([0,0]).reshape(1,2)
x_t = np.array([1]).reshape(1,1)
warr.shape = (nfeature,hunits*4)
uarr.shape = (hunits,hunits*4)
barr.shape = (hunits*4,)
'''
s_t = (x_t.dot(warr) + h_tm1.dot(uarr) + barr)
hunit = uarr.shape[0]
i = sigmoid(s_t[:,:hunit])
f = sigmoid(s_t[:,1*hunit:2*hunit])
_c = np.tanh(s_t[:,2*hunit:3*hunit])
o = sigmoid(s_t[:,3*hunit:])
c_t = i*_c + f*c_tm1
h_t = o*np.tanh(c_t)
return(h_t,c_t)
weightLSTM = model2.layers[1].get_weights()
warr,uarr, barr = weightLSTM
warr.shape,uarr.shape,barr.shape
def intermediate_out(n,warr,uarr,barr,X_test):
for i in range(0, n+1):
if i==0:
c_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_tm1 = np.array([0]*hunits, dtype=np.float32).reshape(1,32)
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][0:1][0:9],h_tm1,c_tm1)
else:
h_t,ct = LSTMlayer(warr,uarr, barr,X_test[0][i:i+1][0:9],h_t,ct)
return h_t
# 1st sequence
ht0 = intermediate_out(0,warr,uarr,barr,X_test)
# 2nd sequence
ht1 = intermediate_out(1,warr,uarr,barr,X_test)
# 128th sequence
ht127 = intermediate_out(127,warr,uarr,barr,X_test)
The outputs of the keras LSTM layer from print(intermediate_output[0,0]) are as follows:
array([-0.05616369, -0.02299516, -0.00801201, 0.03872827, 0.07286803,
-0.0081161 , 0.05235862, -0.02240333, 0.0533984 , -0.08501752,
-0.04866522, 0.00254417, -0.05269946, 0.05809477, -0.08961852,
0.03975506, 0.00334282, -0.02813114, 0.01677909, -0.04411673,
-0.06751891, -0.02771493, -0.03293832, 0.04311397, -0.09430656,
-0.00269871, -0.07775293, -0.11201388, -0.08271968, -0.07464679,
-0.03533605, -0.0112953 ], dtype=float32)
and the outputs of my implementation from print(ht0) are:
array([[-0.05591469, -0.02280132, -0.00797964, 0.03681555, 0.06771626,
-0.00855897, 0.05160453, -0.02309707, 0.05746563, -0.08988875,
-0.05093143, 0.00264367, -0.05087904, 0.06033305, -0.0944235 ,
0.04066657, 0.00344291, -0.02881387, 0.01696692, -0.04101779,
-0.06718517, -0.02798996, -0.0346873 , 0.04402719, -0.10021093,
-0.00276826, -0.08390114, -0.1111543 , -0.08879325, -0.07953986,
-0.03261982, -0.01175724]], dtype=float32)
The outputs from print(intermediate_output[0][1]):
array([-0.13193817, -0.03231169, -0.02096735, 0.07571879, 0.12657365,
0.00067896, 0.09008797, -0.05597101, 0.09581321, -0.1696091 ,
-0.08893952, -0.0352162 , -0.07936387, 0.11100324, -0.19354928,
0.09691346, -0.0057206 , -0.03619875, 0.05680932, -0.08598096,
-0.13047703, -0.06360915, -0.05707538, 0.09686109, -0.18573627,
0.00711019, -0.1934243 , -0.21811798, -0.15629804, -0.17204499,
-0.07108577, 0.01727455], dtype=float32)
print(ht1):
array([[-1.34333193e-01, -3.36792655e-02, -2.06091907e-02,
7.15097040e-02, 1.18231244e-01, 7.98894180e-05,
9.03479978e-02, -5.85013032e-02, 1.06357656e-01,
-1.82848617e-01, -9.50253978e-02, -3.67032290e-02,
-7.70251378e-02, 1.16113290e-01, -2.08772928e-01,
9.89214852e-02, -5.82863577e-03, -3.79538871e-02,
6.01535551e-02, -7.99121782e-02, -1.31876275e-01,
-6.66067824e-02, -6.15542643e-02, 9.91254672e-02,
-2.00229391e-01, 7.51443207e-03, -2.13641390e-01,
-2.18286291e-01, -1.70858681e-01, -1.88928470e-01,
-6.49823472e-02, 1.72227081e-02]], dtype=float32)
print(intermediate_output[0][127]):
array([-0.46212202, 0.280646 , 0.514289 , -0.21109435, 0.53513926,
0.20116206, 0.24579187, 0.10773794, -0.6350403 , -0.0052841 ,
-0.15971565, 0.00309152, 0.04909453, 0.29789132, 0.24909772,
0.12323025, 0.15282209, 0.34281147, -0.2948742 , 0.03674917,
-0.22213924, 0.17646286, -0.12948939, 0.06568322, 0.04172657,
-0.28638166, -0.29086435, -0.6872528 , -0.12620741, 0.63395363,
-0.37212485, -0.6649531 ], dtype=float32)
print(ht127):
array([[-0.47431907, 0.29702517, 0.5428258 , -0.21381126, 0.6053808 ,
0.22849198, 0.25656056, 0.10378123, -0.6960949 , -0.09966939,
-0.20533416, -0.01677105, 0.02512029, 0.37508538, 0.35703233,
0.14703275, 0.24901289, 0.35873395, -0.32249793, 0.04093777,
-0.20691746, 0.20096642, -0.11741923, 0.06169611, 0.01019177,
-0.33316574, -0.08499744, -0.6748463 , -0.06659956, 0.71961826,
-0.4071832 , -0.6804066 ]], dtype=float32)
The outputs from (print(intermediate_output[0,0]), print(h_t[0])) and (print(intermediate_output[0][1]), print(h_t1)) are similar...but the output from print(intermediate_output[0][127]) and print(h_t127) not same and both the algorithms are implemented on the same gpu...
I saw the keras documentation and to me it seems that I am not doing anything wrong....Please comment on this and let me know that what else am I missing here ??